UTF-8 multibyte content corrupted

A place to report and discuss bugs - please mention CMSimple-version, server, platform and browser version
Post Reply
manu
Posts: 1122
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

UTF-8 multibyte content corrupted

Post by manu » Mon Jan 18, 2010 11:55 am

There is a nasty bug in all cmsimple versions using utf-8.
After inserting certain character in content like (U+3067, e3 81 a7, HIRAGANA LETTER DE) the content will be corrupted after saving.

CMSimple XH 1.0 cms.php at line 367 should be corrected as follows:

Code: Select all

//EM~	$c = explode('§', preg_replace("/(<h[1-".$cf['menu']['levels']."][^>]*>)/i", "§\\1", str_replace('§', '&#167;', rf($pth['file']['content'])))); 
	$c = explode("\xC2\xA7", preg_replace("/(<h[1-".$cf['menu']['levels']."][^>]*>)/i", "\xC2\xA7\\1", str_replace("\xC2\xA7", '&sect;', rf($pth['file']['content']))));
Explanation: str_replace('§'... will destroy multibyte Unicode and leave corrupted content. IMHO the correction works for all character encodings/character sets. mb_ereg_replace is probabely not backward compatible for all and didn't deliver solid results with corrupted content.
I hope, this is the only spot making trouble, otherwise -TBC.
Regards
manu

EDIT:

Code: Select all

	preg_match_all("/<h[1-".$cf['menu']['levels']."].*(?:(?=<h[1-".$cf['menu']['levels']."])|\z)/isU", rf($pth['file']['content']),$c);
	$c = $c[0];

I tried to solve the problem above with a preg_match_all function. It works, but the pattern look pretty sofisticated. Even the performance I don't know yet. Perhaps someone might play around with it or find a more specific solution.
Last edited by manu on Wed Jan 20, 2010 1:14 pm, edited 3 times in total.

Holger
Site Admin
Posts: 3470
Joined: Mon May 19, 2008 7:10 pm
Location: Hessen, Germany

Re: UTF-8 multibyte content corrupted

Post by Holger » Tue Jan 19, 2010 9:37 am

manu wrote:There is a nasty bug in all cmsimple versions using utf-8.
Hi manu!

Have you seen this thread: http://www.cmsimpleforum.com/viewtopic.php?f=10&t=360 ?

Holger

manu
Posts: 1122
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: UTF-8 multibyte content corrupted

Post by manu » Tue Jan 19, 2010 11:59 am

Hi Holger

No, I didn't, but...
Pardon me, but this is rather a weak work around at the wrong place.
Why not fix the bug, where it happens (and let the correction lead into the original code)?
Regards
manu

Holger
Site Admin
Posts: 3470
Joined: Mon May 19, 2008 7:10 pm
Location: Hessen, Germany

Re: UTF-8 multibyte content corrupted

Post by Holger » Tue Jan 19, 2010 12:24 pm

Yes, of course you're right.

But your solution seems to be so easy for me that I can't believe that no one else (or Peter) found this since this bug
is known such a long time.
That's why I've pointed you to the other thread with a link to the discussion at the archieved forum.

Anyway, maybe some other users with utf-8 installations can test it.

Holger

manu
Posts: 1122
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: UTF-8 multibyte content corrupted

Post by manu » Tue Jan 19, 2010 12:37 pm

simple looking solutions don't prove simple problems ;) . As the problem happens rather more sporadic, it took me also a few hours to find and fix it. Actually I'm setting up a british site with japanese as 2nd language (even I don't understand or read/write any japanese) http://www.realondon.net/ja/. I hope someonelse can prove it as correct and I hope this is the only spot making problems.
manu

Holger
Site Admin
Posts: 3470
Joined: Mon May 19, 2008 7:10 pm
Location: Hessen, Germany

Re: UTF-8 multibyte content corrupted

Post by Holger » Tue Jan 19, 2010 1:35 pm

manu wrote:simple looking solutions don't prove simple problems ;) .
Yep, I hope you're right. Maybe it's only my pessimistic phase at the moment :? .
manu wrote:Actually I'm setting up a british site with japanese as 2nd language (even I don't understand or read/write any japanese) http://www.realondon.net/ja/.
Whow, looks interesting. How did you manage to edit that nice characters? Copy'n paste?

The source code is looking fine too and also the plugin seems to work with your patch.
manu wrote:IMHO the correction works for all character encodings/character sets.
So you have made some tests with ANSI / ISO encoded content together with your patch too?

Holger

manu
Posts: 1122
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: UTF-8 multibyte content corrupted

Post by manu » Tue Jan 19, 2010 9:36 pm

How did you manage to edit that nice characters? Copy'n paste?
Luckily the customer puts in the content, with hiragama keyboard..
So you have made some tests with ANSI / ISO encoded content together with your patch too?
No, since I have just the actual website in a xh version. But in a logical way it should work. Ask Murphy.

But it turns out in a little nightmare...
cms.php function h():

return trim(strip_tags(preg_replace("/(<h[1-".$cf['menu']['levels']."][^>]*>([^§]*?)<\/h[1-".$cf['menu']['levels']."]>)?[^§]*/i", "\\2", $c[$n])));

Can anybody tell me, what the exclusion of § is for?? Again, as a part of a unicode multibyte sequence I have to leave it untouched, but the h() function doesn't run proberly anymore.

modified ~line402:

Code: Select all

	return trim(strip_tags(preg_replace("/(<h[1-".$cf['menu']['levels']."][^>]*>(.*?)<\/h[1-".$cf['menu']['levels']."]>)?.*/is", "\\2", $c[$n])));
The same in l():
Original code ~line 408:

if (isset($c[$n]))return preg_replace("/<h([1-".$cf['menu']['levels']."])[^>]*>[^§]*/i", "\\1", $c[$n]);

modified:

Code: Select all

	if (isset($c[$n]))return preg_replace("/<h([1-".$cf['menu']['levels']."])[^>]*>.*/is", "\\1", $c[$n]);
With these modifications the japanese website seems to run properly.
I'll keep on testing but hope that this is it.
Regards
manu

Martin
Posts: 346
Joined: Thu Oct 23, 2008 11:57 am
Contact:

Re: UTF-8 multibyte content corrupted

Post by Martin » Wed Jan 20, 2010 3:48 pm

Hi Manu,

I was about to offer another solution for discussion, and now I see, you edited your first post and came up with the same idea (preg_match_all() instead of inserting a silly token "§" to explode it afterwards) like me. That really should be the better thing! (And I am pretty sure that we can strip this damn "§" from l() and h() as well.)

My proposal was a complete rewrite of the rfc(), that does not use l() and h() anymore:

Code: Select all

function rfc(){
    global $c, $cl, $h, $u, $l, $su, $s, $pth, $tx, $edit, $adm, $cf;

    $c = array();
    $h = array();
    $u = array();
    $l = array();
    $empty = 0;
    $duplicate = 0;
    
    $content = file_get_contents($pth['file']['content']);
    $stop = $cf['menu']['levels'];
    $pattern = '/(<h([1-'.$stop.'])[^>]*>(.*)<\/h[1-'.$stop.'](.+))(?=(<(h[1-'.$stop.']|\/body).*>))/isU';
    preg_match_all($pattern, $content, $pages);

    $c = $pages[1];
    $cl = count($c);

    if ($cl == 0){
        $c[] = '<h1>'.$tx['toc']['newpage'].'</h1>';
        $h[] = trim(strip_tags($tx['toc']['newpage']));
        $u[] = uenc($h[0]);
        $l[] = 1;
        $s = 0;
        return;
    }

    $l = $pages[2];
    $ancestors = array();  /* just a helper for the "url" construction:
                            * will be filled like this [0] => "Page"
                            *                          [1] => "Subpage"
                            *                          [2] => "Sub_Subpage" etc.
                            */

    foreach($pages[3] as $i => $heading){
        $temp = trim(strip_tags($heading));
        if($temp == ''){
            $empty++;
            $temp = $tx['toc']['empty']. ' '. $empty;
        }
        $h[] = $temp;
        $ancestors[$l[$i]-1] = uenc($temp);
        $ancestors = array_slice($ancestors,0, $l[$i]);
        $url = implode($cf['uri']['seperator'], $ancestors);
        $u[] = substr($url, 0, $cf['uri']['length']);
    }

    foreach($u as $i => $url){
        if ($su == $u[$i]){$s = $i;} // get index of selected page

        for($j = $i + 1; $j < $cl; $j++){   //check for duplicate "urls"
            if($u[$j] == $u[$i]){
                $duplicate++;
                $h[$j] = $tx['toc']['dupl'].' '.$duplicate;
                $u[$j] = uenc($h[$j]);
            }
        }
    }
    if(!($edit && $adm)){
        foreach($c as $i => $j) {
            if (cmscript('remove', $j)){$c[$i] = '#CMSimple hide#';}
        }
    }
}
 
EDIT 29. June 2010: Changed on line for proper url contruction: "array_slice($ancestors,0, $l[$i]);" => "$ancestors = array_slice($ancestors,0, $l[$i]);"

I don't know, whether l() and h() are used in any plugin. But to keep it safe, they could be simplified to

Code: Select all

function h($n) {
    global $h;
    return $h[$n];
}
function l($n) {
    global $l;
    return $l[$n];
} 
Would you be so kind to have alook at it?

Martin
Last edited by Martin on Tue Jun 29, 2010 1:34 pm, edited 1 time in total.

manu
Posts: 1122
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: UTF-8 multibyte content corrupted

Post by manu » Wed Jan 20, 2010 4:11 pm

Hi Martin

Looks brilliant to me from a far sight.
As soon as I have a proper testing environment I'll include and test it.

Regards
manu

Post Reply