Page 1 of 1

Search function and Unicode equivalence

Posted: Tue May 13, 2014 12:52 pm
by cmb
Hello Community,

it seems reasonable to improve the search function wrt. Unicode equivalence, what is traditionally completely neglected.

PHP's intl extension offers grapheme_strpos() which could be used instead of the current strpos(), if available (otherwise we'd had to fall back to strpos() anyway).

We should furthermore consider to use a case-insensitive comparision, i.e. grapheme_stripos(), and otherwise fall back to the current algorithm that uses utf8_strtolower() and strpos(). Especially this step might provide a performance improvement and better results.

Christoph

Re: Search function and Unicode equivalence

Posted: Tue May 13, 2014 1:25 pm
by manu
+1

Re: Search function and Unicode equivalence

Posted: Mon Aug 18, 2014 6:46 pm
by cmb
cmb wrote:it seems reasonable to improve the search function wrt. Unicode equivalence, what is traditionally completely neglected.

PHP's intl extension offers grapheme_strpos() which could be used instead of the current strpos(), if available (otherwise we'd had to fall back to strpos() anyway).
As I found out grapheme_strpos() doesn't cater for Unicode equivalence; actually, it only reports the position of the needle within the haystack in Unicode code points, what is not helpful for our purpose.

However, there is Normalizer::normalize() which is also part of the intl extension. I have used this instead.

Furthermore I have added utf8_stripos() to Utf8_XH, which tries to use mb_stripos(), and falls back to utf8_strtolower() and strpos(). I have used this new function for the search.

(r1349-r1352)