XH 1.6: Linkchecker overhaul

Discussions and requests related to new CMSimple features, plugins, templates etc. and how to develop.
Please don't ask for support at this forums!
cmb
Posts: 13273
Joined: Tue Jun 21, 2011 11:04 am
Location: Mü-Sa, RLP, DE
Contact:

XH 1.6: Linkchecker overhaul

Post by cmb » Sat Nov 24, 2012 6:35 pm

Hello Community,

in classic CMSimple the link checker was very simple: just gather all links found in the content, and send a GET request for each and check the HTTP response status code. As these checks where done one after the other, the link check was very slow, if there were many links in the content.

So in CMSimple_XH this simple check was changed to distinguish between internal and external links, where external links are handled as before, but internal links check the URL array of the pages resp. the file system directly. As the differantiation between internal and external links and particularly the checking of internal links is much more complex than the simple check of the classic CMSimple some bugs were introduced. Over time the link checker was improved (the last time for XH 1.5), but there are still some unsolved issues. Some where reported lately by snafu and several others were reported by myself in the internal forum.

So I suggest to overhaul the link checker for XH 1.6. Either we stick with the current distinction between internal and external links and "just" solve the remaining issues, or we might consider going back to the "classic" solution but sending the GET requests asynchronously (i.e. in parallel), which could be done with Ajax. Of course it's possible to combine both solutions.

Relying on curl_multi_*() is probably no option as this would require the curl extension (not sure if that's available "everywhere") and PHP 5. But this could still be offered by a plugin.

Christoph
Christoph M. Becker – Plugins for CMSimple_XH

manu
Posts: 722
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by manu » Sun Dec 30, 2012 1:43 pm

cmb wrote:Relying on curl_multi_*() is probably no option as this would require the curl extension (not sure if that's available "everywhere") and PHP 5. But this could still be offered by a plugin.
How about a fallback to the classic method if curl extension is not installed?
..test: less than 30 seconds response time for 58 checked links, i could live with that..
But an overhoul is reasonable, system functions ("?mailform") should be interpreted correctly.

cmb
Posts: 13273
Joined: Tue Jun 21, 2011 11:04 am
Location: Mü-Sa, RLP, DE
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by cmb » Sun Dec 30, 2012 2:33 pm

manu wrote:How about a fallback to the classic method if curl extension is not installed?
That might be a very good idea (depending of the percentage of installations which offer curl).
manu wrote:..test: less than 30 seconds response time for 58 checked links, i could live with that..
The problem: if the external URLs are checked one at a time, just a delay from one or two URLs might cause a timeout in the link checker.
Christoph M. Becker – Plugins for CMSimple_XH

cmb
Posts: 13273
Joined: Tue Jun 21, 2011 11:04 am
Location: Mü-Sa, RLP, DE
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by cmb » Wed Jan 16, 2013 12:55 am

Hello Community,
cmb wrote:or we might consider going back to the "classic" solution but sending the GET requests asynchronously (i.e. in parallel), which could be done with Ajax
This solution might have some advantages. For demonstration purposes and as a proof of concept, I've implemented a very simple link checker as a plugin. I'm quite confident, that it doesn't do any harm, but don't rely on its results ;).

I'm looking forward for your feedback. I'm particularly interested about the performance under real world conditions in comparision to the built-in link checker.

Christoph
Christoph M. Becker – Plugins for CMSimple_XH

manu
Posts: 722
Joined: Wed Jun 04, 2008 12:05 pm
Location: St. Gallen - Schweiz
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by manu » Wed Jan 16, 2013 5:34 pm

Pretty cool:
Plugin Linkchecker: Total: 54 – OK: 36 – Warn: 3 – Fail: 15 – ToDo: 0 ::33.7secs
Built in Link check: 58 Links wurden geprüft. :: 31.4secs

So no big difference?
Plugin Linkchecker seems to wait quite long until its starts to countdown.
regards
manu

EDIT: BTW What's this link in the plugin directory:insert-link Symbole, Gratissymbole in Human-O2 , (Symbol-Suchmachine).url ??

cmb
Posts: 13273
Joined: Tue Jun 21, 2011 11:04 am
Location: Mü-Sa, RLP, DE
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by cmb » Wed Jan 16, 2013 6:30 pm

Hi manu,

thanks for testing.
manu wrote:Plugin Linkchecker seems to wait quite long until its starts to countdown.
The procedure is quite simple:
  1. gather all links in the content and deliver the list to the browser
  2. for each link: call CMSimple with a respective GET parameter which contains the URL to check
  3. trigger a HEAD request to the passed URL and return the first line of the response
The bottleneck is (2). This is at least necessary for external URLs due to the same origin policy restrictions. Internal URLs could be checked by triggering the HEAD request directly. But even then I don't expect Linkchecker_XH to be much faster than the current solution in the core (which might be improved much more); but there is the advantage of the immediate and ongoing visual feedback for the user.

The link in the directory points to a potential plugin icon, upon which I've stumbled.

Christoph
Christoph M. Becker – Plugins for CMSimple_XH

cmb
Posts: 13273
Joined: Tue Jun 21, 2011 11:04 am
Location: Mü-Sa, RLP, DE
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by cmb » Thu Apr 25, 2013 11:50 am

cmb wrote:in classic CMSimple the link checker was very simple: just gather all links found in the content, and send a GET request for each and check the HTTP response status code.
That's not correct. I had a closer look at the linkcheck code of CMSimple 3.4, and all links without an explicit "http://" scheme are checked only against being a link to an internal page of the current language.

However, I tried to improve my former draft to check internal links directly from the browser. Basically that would work fine, but in admin mode links to non-existing pages just return 200 and do not provide some easily checkable evidence, that the page does not exist (the contents area is just blank in normal mode). Either we fix this behavior (i.e. report URLs to non-existant resources even in admin mode as 404) or we stick with the current behavior of the link check and try to fix remaining issues. The former may not be solvable without introducing incompatibilities with existing extensions; the latter is likely requiring even more fiddling around with special cases, and won't definitely catch all of them (such as links to special pages introduced by plugins, e.g. ?Register).

Anyway, a solution to avoid the time-out resp. missing feedback issue may not be too hard to implement: trigger the actual checking of links from the browser in chunks (maybe page by page). If necessary, state information could be stored in a session variable.

About the performance of the link check: this may be improved one way or the other, but probably it's wise to trigger not too many requests to a domain simultaneously, which may slow down the server for other requests and even could be regarded as some kind of DOS attack (e.g. gxSecurity may prevent such request floods).

OTOH I have some doubts, that the link checker is a useful tool. Neither does it check links in the template nor links in plugin generated contents (such as guestbooks). There are several external tools, which do so, and which seem to be quite mature. Maybe it's better to point users to such tools. :?
Christoph M. Becker – Plugins for CMSimple_XH

oldnema
Posts: 265
Joined: Wed Jan 21, 2009 5:15 pm
Location: Czech Republic
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by oldnema » Thu Apr 25, 2013 12:45 pm

cmb wrote:OTOH I have some doubts, that the link checker is a useful tool. Neither does it check links in the template nor links in plugin generated contents (such as guestbooks). There are several external tools, which do so, and which seem to be quite mature. Maybe it's better to point users to such tools. :?
Yes, I have the same opinion.

Josef
Nobody knows how much time he has left ...
http://oldnema.compsys.cz/en/?Demo_templates

cmb
Posts: 13273
Joined: Tue Jun 21, 2011 11:04 am
Location: Mü-Sa, RLP, DE
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by cmb » Thu Jul 11, 2013 9:23 pm

Hello Community,
oldnema wrote:
cmb wrote:OTOH I have some doubts, that the link checker is a useful tool. Neither does it check links in the template nor links in plugin generated contents (such as guestbooks). There are several external tools, which do so, and which seem to be quite mature. Maybe it's better to point users to such tools.


Yes, I have the same opinion.
Still, an internal link checker might be quite useful. Particularly that it doesn't check the links in plugin generated contents can be a nice advantage; consider for instance Forum_XH which has a lot of links that will be checked by an external tool, even if they are generated by the plugin and so cannot be wrong (assuming the plugin has no bug in this regard). So if the internal link checker is not too much work, we might stick with it.

I have thought quite a while about this issue. I came to the following conclusions:
  • Trying to check internal links "directly" (i.e. without actually requesting the URL) is by far too hard to get it right. For instance, in the current state the linkchecker does not even accept ?mailform, let alone special pages of plugins (such as ?Register), but it does accept unpublished pages or pages available only to members (the letter may not be considered bad).
  • To reach an acceptable performance the checking of the links has to be done asynchronously. This may not help too much if all links point to the same domain, but this is usually not the case.
  • Checking all links from the browser is too wasteful and not fast enough.
  • Checking the links partially client-side and partially server-side (to circumvent the Same-Origin-Policy) is too expensive to develop and to maintain (very similar algorithms would have to be deployed in PHP and JavaScript).
  • Relying on curl_multi_*() is too restrictive, as this may not be available on many hosts. Providing a fallback makes the algorithm too complex.
  • Checking if the fragment (# part) has actually a counterpart (<a name> or some id) is probably too wasteful (instead of making a HEAD request a GET would be required)
Fortunately I've stumbled upon stream_select(), which I expect to be more widespread available as curl_multi_*(). This allows an assynchronous server-side link checking. :) I have written a small plugin as PoC, which mostly consists of a single class, which could be used as replacement for the current link checker in the core. A test on my localhost with the default content of CMSimple_XH 1.5.7 (18 links) took roughly only a third of the time as the built-in link checker and had the same results. Checking the 185 links on my website was roughly 6 (!) times faster (4sec) than with the built-in link checker (25sec), and it didn't report that ./?&mailform is a "faulty internal Link, page does not exist".

As the link "profile" may be very different on other CMSimple installations, I want to ask you to try the plugin on your website and compare the performance and results with the built-in link checker, and to report back. TIA.

Download LinkCheck_XH

Christoph
Christoph M. Becker – Plugins for CMSimple_XH

Holger
Site Admin
Posts: 3098
Joined: Mon May 19, 2008 7:10 pm
Location: Hessen, Germany
Contact:

Re: XH 1.6: Linkchecker overhaul

Post by Holger » Thu Jul 18, 2013 9:04 pm

Hi Christoph,

I've made some tests and indeed it's really faster than the internal LinkCheck.

But I've noticed that redirected pages are now reported to check by the user.
For example say you have a sub-navigation with a list of internal links in a newsbox.
The links are redirected to pages like "Impressum"...

The internal check does not report anything, but the plugin gives a hint
Hinweise:

Link: Impressum
Linkziel: ?sidenav:Impressum
Fehler: Verlinkte Seite wird weitergeleitet, bitte Link überprüfen.
http Statuscode: 302
Holger

Post Reply