PHP and UTF-8

Discussions and requests related to new CMSimple features, plugins, templates etc. and how to develop.
Please don't ask for support at this forums!

PHP and UTF-8

Postby cmb » Thu Jun 14, 2012 5:04 pm

Hello Developers,

since quite a while I'm aware that PHP's string routines are not UTF-8 save. But now I've noticed that even trim() fails to strip UTF-8 non-breaking spaces. :( And I'm aware, that CMSimple_XH still uses several of the string routines in places which are prone to error. Consider e.g.
Code: Select all
ucfirst($tx['action']['save']) 

This results in garbage, if the first letter of $tx['action']['save'] is a non ASCII letter. PHP's mbstring extension offers several alternative functions, which are suitable for handling UTF-8. But unfortunately some of them are available only in newer PHP versions, and even worse several string functions are not implemented at all. I'm not sure, if mbstring is available on all servers, but I guess there's no other way than to rely on this extension to make CMSimple_XH UTF-8 save. Or should we wait for PHP6? ;)

Christoph
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Re: PHP and UTF-8

Postby svasti » Thu Jun 14, 2012 5:24 pm

php and utf-8 = :oops:
propably we'll need a work around...
svasti
 
Posts: 687
Joined: Wed Dec 17, 2008 5:08 pm
Location: Bielefeld, Germany

Re: PHP and UTF-8

Postby cmb » Thu Jun 14, 2012 8:34 pm

svasti wrote:propably we'll need a work around...

Maybe I've found more than that: PHP UTF-8 :)

PS: This fork won't fit well with CMSimple_XH as it requires PHP 5.3 (namespaces). But the original PHP UTF-8 seems to run well under PHP 4! The project seems to be dead, but the code is well written and it might be a easier to reuse this library than to start from scratch. It has to be tested thoroughly though (a test suite is already included).
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Re: PHP and UTF-8

Postby svasti » Thu Jun 14, 2012 10:09 pm

Sounds promising, as even DokuWiki seems to use it
https://github.com/FSX/php-utf8 wrote:here is a fair degree of collaboration/exchange of ideas and code between Dokuwiki's UTF-8 library and phputf8.
That's may be as best as we can get. I also thought, we would have to rewrite some functions to tacle the utf-8 issues.
svasti
 
Posts: 687
Joined: Wed Dec 17, 2008 5:08 pm
Location: Bielefeld, Germany

Re: PHP and UTF-8

Postby Holger » Fri Jun 15, 2012 10:52 am

Hi,

I thought that there were not so much flaws left in the core regarding utf-8.
And even if I think it's time to drop sticking on PHP4, it's maybe a good idea to have a closer look on PHP UTF-8.

KR,
Holger
Holger
Site Admin
 
Posts: 2582
Joined: Mon May 19, 2008 7:10 pm
Location: Hessen, Germany

Re: PHP and UTF-8

Postby cmb » Fri Jun 15, 2012 11:46 am

Hi Holger,

Holger wrote:I thought that there were not so much flaws left in the core regarding utf-8.

I'm not sure about how many flaws are left in the core regarding UTF-8, but at least there are some (probably minor ones). But such an library might be useful not only for the core, but for plugins too. Using mbstring might be an option, but I have no clue, how widespread it's available (even on PHP5). Even the core's search function provides a fallback for mb_strtolower(), which depends on the xml extension, and is likely to fail for non "western" languages.

Holger wrote:And even if I think it's time to drop sticking on PHP4

Quite a while ago, I was strongly advocating support for PHP 4. I'm not sure about that anymore. However: as long as the core doesn't need PHP 5 features (I'm not talking about simple functions as array_combine() and file_put_contents(), but more about general features as exception handling and "real" object orientation), we might as well stick with PHP 4.3 compatibility. Namespaces would be really nice for plugins, but requiring PHP 5.3 seems to be too much (many shared hosters offer PHP 5.2.x only).

Christoph
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Re: PHP and UTF-8

Postby cmb » Sat Jun 16, 2012 12:10 pm

I've had a look at the test suite. After making the necessary changes due to the slightly changed interface of SimpleTest, the tests run quite well. 2 failures were due to false expectations (invalid UTF-8 were asserted to be utf8_compliant()). I found a small inconsistency between the native and the mbstring based implementation: utf8_strpos() returns different results for invalid UTF-8 sequences -- IMO nothing to worry about. All other 302 test cases passed fine.

It's probably best to make it available as a plugin (without index.php and admin.php though). So Utf8_XH should better be renamed to Utf8check_XH or so. The new Utf8 plugin can be used from other plugins and from the core as desired (any plugin using it should clearly state the dependency as done with jQuery4CMSimple).
Last edited by cmb on Wed Jul 18, 2012 8:53 pm, edited 1 time in total.
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Re: PHP and UTF-8

Postby cmb » Thu Jun 21, 2012 11:11 am

I just (re)read http://www.phpwact.org/php/i18n/utf-8. IMO an important lecture for PHP developers, who have to handle UTF-8.

Short summary: probably quite something to do to make CMSimple_XH absolutely UTF-8 safe.
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Re: PHP and UTF-8

Postby cmb » Sat Jul 28, 2012 5:28 pm

I found another minor issue related to UTF-8: html_entity_decode() doesn't work with multibyte encodings before PHP 5.0.0. It's used only once in the base distribution, namely in search.php to convert html entities in the content.

IMO it's not necessary anymore to have html entities in the content (except for &, < and >) when using UTF-8. tinyMCE is configured to do so (entity_encoding: 'raw') as well as CKEditor (entities: false). So a genenral html_entity_decode() would only be necessary for content, that was created with other editors and/or with ANSI-encoding. I'm planning to add this conversion to Utf8migrator, so after converting the old content once, only the 3 mentioned entities would have to be converted in the search function, which might be done by:
Code: Select all
$pagexyz = str_replace(array('&amp;', '&lt;', '&gt;'), array('&''<''>'), $pagexyz); 

(BTW: we might consider changing $pagexyz to $temp)
Last edited by cmb on Fri Aug 17, 2012 8:55 pm, edited 1 time in total.
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Re: PHP and UTF-8

Postby cmb » Tue Aug 07, 2012 10:39 pm

I've updated PHPUTF8 to work with the latest versions of PHPDocumentor and SimpleTest, added utf8_wordwrap() and fixed 2 bugs. Interested developers can download it on http://sourceforge.net/projects/phputf8lib/. Please note that this is a general developer version not especially intended for CMSimple_XH.

I suggest adding a production version[1] as utility plugin to the base distribution. This way we can fix the remaining minor issues with ucfirst(), substr() & al., and it can be used by plugin authors for any UTF-8 string processing.

In addition we might consider checking all GPC data for UTF-8 validity to avoid potential security risks by malformed input.

----
[1] i.e. a version containing only the actual library files (no docs, tests, etc.); it will take about <EDIT>100k</EDIT> uncompressed.

PS: added PUPUTF8 (r252), reviewed string functions (r254) and regexps (r248)
Last edited by cmb on Fri Aug 17, 2012 9:22 pm, edited 2 times in total.
Christoph M. Becker---Plugins for CMSimple_XH
cmb
 
Posts: 5484
Joined: Tue Jun 21, 2011 11:04 am
Location: Germany

Next

Return to Open Development

Who is online

Users browsing this forum: Google [Bot] and 1 guest