Addon: Cleaner URLs
Posted: Sun Dec 02, 2012 2:28 pm
Hello Community,
this is not about the so-called clean URLs, i.e. URLs without the ? at the beginning. If you want to have these you can try using the respective tip in the CMSimple Wiki. Good luck
IMO such clean URLs are overvalued. The ? is no problem for search engines, and IMO only a minor issue for humans. What matters is the following:
(1) can be solved by using urichar_org/new. But that requires to cater for all potential special chars in headings (what's quite impossible) or to adjust urichar_org/new whenever a new special character is used in a heading. Note that it's not possible to replace a comma with urichar_org/new.
So I've thought about an automatic solution, which works the following way:
Please note that this doesn't use urichar_org/new at all, and that the translation only caters for western and central european languages, so you can't use it for e.g. Russian or Chinese. If there is a better or more appropriate transliteration library available, you can use it instead of Utf8_XH (see the comments in the code).
And please note that this is only roughly tested, and that it will change your URLs, so existing backlinks and bookmarks might break.
Christoph
this is not about the so-called clean URLs, i.e. URLs without the ? at the beginning. If you want to have these you can try using the respective tip in the CMSimple Wiki. Good luck
IMO such clean URLs are overvalued. The ? is no problem for search engines, and IMO only a minor issue for humans. What matters is the following:
- The URL shouldn't contain any URL encoded characters (such as Umlauts), as one cannot recognize or remember these easily: http://www.example.com/?Fahrvergn%FCgen. I'm well aware that modern browsers are capable of handling many special characters in the URL, but IMO it's better to avoid them generally.
- The underscore should be avoided, as many users are not familiar with this character and it is hard to spot if a link is underlined: http://www.example.com/?Hard_to_spot. phpBB handles this quite okay here, but see my signature for an underscore, that is really hard to spot .
- The slash should be used as delimiter of the page headings of different levels, which is quite common.
- The URL shouldn't contain mixed case characters, as these are hard to remember. It's quite common to have lower case letters only.
(1) can be solved by using urichar_org/new. But that requires to cater for all potential special chars in headings (what's quite impossible) or to adjust urichar_org/new whenever a new special character is used in a heading. Note that it's not possible to replace a comma with urichar_org/new.
So I've thought about an automatic solution, which works the following way:
- replace all HTML entities
- apply some kind of transliteration (e.g. é -> e, ä -> ae)
- replace all characters that would be URL encoded with a minus sign
- replace all occurences of more than one consecutive minus sign with a single minus sign
- convert the characters to lower case
Code: Select all
function uenc($s)
{
global $tx;
require_once UTF8 . '/utils/ascii.php'; // optionally replace with better transliteration library
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
$s = utf8_accents_to_ascii($s); // optionally replace with better transliteration function
$s = rawurlencode($s);
$s = preg_replace('/%[a-f0-9]{2}/i', '-', $s);
$s = preg_replace('/\-+/', '-', $s);
$s = strtolower($s);
return $s;
}
And please note that this is only roughly tested, and that it will change your URLs, so existing backlinks and bookmarks might break.
Christoph