Actually, thinking about this some more, what we really want in a link
checker rather than something like MD5 (and maybe Andy's keyword/exception
list) is a some kind of difference metric. Something like (for example)
number of lines changed divided by the total number of lines in the
document (or maybe words for finer granularity). So if no lines (or
words) changed between checks the metric would be 0 and if they all
changed it would be 1. Then have the link checker configurable to ignore
changes less than (say) 0.2, otherwise flag them for someone to look at.
This would allow a website to make minor changes, etc over time without
firing off warnings but should pickup a total change like the HTML
validator to adult entertainment site one pretty easily. This would only
really work for HTML (and maybe plain text) documents but hey, how many
PDF or PostScript based porno/spam sites are there out there (that's not a
challenge to find out by the way folks!).
Hmm, I can feel a ROADS lc.pl hack coming on... ;-)
Tatty bye,
Jim'll
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|