XH 1.6.1 : Malformed UTF-8 detected!

A place to report and discuss bugs - please mention CMSimple-version, server, platform and browser version
Post Reply
Bob
Posts: 120
Joined: Sat Jun 14, 2008 8:30 am
Location: France
Contact:

XH 1.6.1 : Malformed UTF-8 detected!

Post by Bob » Fri Feb 14, 2014 11:00 am

Hello
I had this message from a customer who uses a french website under XH 1.6.1 (that i made, HTML5, correctly UTF-8 encoded)). I cannot reproduce this error but I know that it existed with XH 1.5.4. Do I have to look for the problem on the browser of the customer or a problem still exists on the last version? :roll:
Bob

cmb
Posts: 14225
Joined: Tue Jun 21, 2011 11:04 am
Location: Bingen, RLP, DE
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by cmb » Fri Feb 14, 2014 11:33 am

This message is triggered by the check of the input variables to be valid UTF-8. Now, that I rethink your problem, it seems the check (or at least the handling) is to strict. For instance, there could be problems with websites that were formerly encoded as ANSI, if someone has bookmarked a deep link to a page "Téléchargement". Encoded as ISO 8859-1 the link is ?T%E9l%E9chargement, what will result in said message under XH 1.6/1.6.1 (but not under XH < 1.6). I'm not sure whether there may be other cases where this message will show up, unless the user is having an old or even misconfigured browser.

Anyway, we have to reconsider the check and maybe drop it altogether. Actually, it is not necessary for security per se, but is rather meant to suppress security issues with some mostly old plugins which were not written for UTF-8 encoding resp. newer PHP versions.

As a quick workaround for your client's installation, just remove the following from cmsimple/cms.php (line 330ff):

Code: Select all

XH_checkValidUtf8(
    array($_GET, $_POST, $_SERVER, array_keys($_GET), array_keys($_POST))
);
Christoph M. Becker – Plugins for CMSimple_XH

Bob
Posts: 120
Joined: Sat Jun 14, 2008 8:30 am
Location: France
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by Bob » Fri Feb 14, 2014 4:50 pm

Thank you for this tip Christoph.
I'm going to test with this solution. I verified the links in the content, i did not see strange encoding text and all texts and variables are in UTF-8 encoded... I verified with the php function mb_check_encoding() and I specify that only the customer sees this error (with 5 PC at home and many browsers, caches and cookies emptied).

cmb
Posts: 14225
Joined: Tue Jun 21, 2011 11:04 am
Location: Bingen, RLP, DE
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by cmb » Tue Apr 15, 2014 5:19 pm

I had completely forgotten this issue. As it seems to be a bug, I've put it on the 1.6.2 roadmap. I suggest that we simply revert to the less restrictive XH 1.5.x check for now.

PS: cf. http://cmsimpleforum.com/viewtopic.php?f=29&t=7127
Last edited by cmb on Tue Apr 15, 2014 10:11 pm, edited 1 time in total.
Reason: added PS
Christoph M. Becker – Plugins for CMSimple_XH

svasti
Posts: 1651
Joined: Wed Dec 17, 2008 5:08 pm

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by svasti » Tue Apr 15, 2014 8:25 pm

cmb wrote:I suggest that we simply revert to the less restrictive XH 1.5.x check for now.
+1

manu
Posts: 1086
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by manu » Mon Apr 21, 2014 1:18 pm

cmb wrote:I suggest that we simply revert to the less restrictive XH 1.5.x check for now.
As the check seems reasonable, why not just omit the check of array_keys($_GET)?

cmb
Posts: 14225
Joined: Tue Jun 21, 2011 11:04 am
Location: Bingen, RLP, DE
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by cmb » Mon Apr 21, 2014 2:29 pm

manu wrote:As the check seems reasonable, why not just omit the check of array_keys($_GET)?
Might be the best option.
Christoph M. Becker – Plugins for CMSimple_XH

cmb
Posts: 14225
Joined: Tue Jun 21, 2011 11:04 am
Location: Bingen, RLP, DE
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by cmb » Mon Apr 21, 2014 9:35 pm

FWIW: I've made some quick benchmark tests regarding the UTF-8 check, with the following command:

Code: Select all

ab -n 1000 -c 10 http://localhost/xh161e/
Results:
  • Plain XH 1.6.1 (i.e. full checking):

    Code: Select all

    Time per request:       96.486 [ms] (mean)
  • No checks:

    Code: Select all

    Time per request:       68.764 [ms] (mean)
  • Checks as with XH 1.5.10:

    Code: Select all

    Time per request:       67.204 [ms] (mean)
  • full checking, except $_SERVER:

    Code: Select all

    Time per request:       65.764 [ms] (mean)
Especially checking $_SERVER seems to make a huge difference. Further investigation showed that count($_SERVER)==43 with a total string length of less than 2kB, only. Obviously, the underlying utf8_is_valid() is terribly slow--not surprisingly when one inspects its sources in plugins/utf8/utils/validation.php. Replacing utf8_is_valid() with utf8_compliant() and full checking had the following result:

Code: Select all

Time per request:       66.564 [ms] (mean)
Christoph M. Becker – Plugins for CMSimple_XH

cmb
Posts: 14225
Joined: Tue Jun 21, 2011 11:04 am
Location: Bingen, RLP, DE
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by cmb » Fri Apr 25, 2014 11:34 am

I had a closer look at the utf8_is_valid() vs. utf8_compliant() issue. The sources (plugins/utf8/utils/validation.php) point to a comment of the original author of the utf8 library:
PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode
As this comment was made eight years ago, I double-checked that, and apparently, the behavior has changed in newer PCRE versions, so since PHP 4.4.9 and PHP 5.2.5 (standard builds) valid UTF-8 sequences are not regarded as valid UTF-8 by PCRE.

Further investigation showed that the relevant behavior changed with PCRE 7.3 2007-08-28, what is documented in the PCRE changelog as item 15:
Updated the test for a valid UTF-8 string to conform to the later RFC 3629. This restricts code points to be within the range 0 to 0x10FFFF, excluding the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the full range 0 to 0x7FFFFFFF, as defined by RFC 2279.
So I can safely update utf8_is_valid() to make use of the much faster check when an approriate PCRE version is installed.
manu wrote:As the check seems reasonable, why not just omit the check of array_keys($_GET)?
Considering the above: +1
Christoph M. Becker – Plugins for CMSimple_XH

cmb
Posts: 14225
Joined: Tue Jun 21, 2011 11:04 am
Location: Bingen, RLP, DE
Contact:

Re: XH 1.6.1 : Malformed UTF-8 detected!

Post by cmb » Wed May 21, 2014 7:25 pm

Done (r1300+r1301).
Christoph M. Becker – Plugins for CMSimple_XH

Post Reply