Ok folks, as promised I had cuppa or two and banged together a quick
_proof_of_concept_ Perl script for extracting my favoured "dot kludge"
version of the embedded metadata. I've attached the script to this
email. Its far from perfect (it assumes all of the META element is on a
single line in the HTML document for instance) but it does prove that the
parsing of the dot separated SCHEME, TYPE, etc sub-element info from the
schema and element information is pretty trival. The bit that does that
is a whopping four lines of Perl. So the complexity of parsing is one con
that we can knock off the list for the "dot kludge".
The standard output of the script is a DCES SGML DTD conforming document (I
hope!). You feed your HTML with the embedded metadata in standard input
- there are no command line options or flash bits (its 2.35am and I've
been here for the last 17 hours... :-) ). As an example, the ROADS
Software home page (<URL:http://www.roads.lut.ac.uk/>) comes out as:
<!DOCTYPE dublinCore PUBLIC '-//OCLC//DTD Dublin core v.1//EN'>
<dublinCore>
<title>ROADS Project Software Distribution</title>
<author SCHEME='name'>Martin Hamilton</author>
<author SCHEME='e-mail'>[log in to unmask]</author>
<author SCHEME='name'>Jon Knight</author>
<author SCHEME='e-mail'>[log in to unmask]</author>
<subject SCHEME='abstract'>ROADS Software and Technical
Information</subject>
<subject SCHEME='keywords'>ROADS, software, technical</subject>
<identifier>http://www.roads.lut.ac.uk/</identifier>
<objecttype>Official Project Document</objecttype>
<form SCHEME='IMT'>text/html</form>
<language SCHEME='ISO639'>en</language>
</dublinCore>
Note that this is less than a page of Perl, and about a quarter of that is
my comments at the top. If people like the idea, I might tart it up, make
it more robust and stick it on CPAN as a Perl module. However coding it
in C, Visual Basic or Java is left as an exercise for the reader... :-)
Tatty bye,
Jim'll
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU.
* I've found I now dream in Perl. More worryingly, I enjoy those dreams. *
#!/usr/bin/perl
#
# dcgrab.pl : Script to grab embedded DCES elements from the HEAD of an
# HTML document and spit them out as a DCES SGML document
#
# Author: [log in to unmask]
#
# $Id$
#
$notseenendhead=1;
$nodtd=1;
while($notseenendhead) {
$line=<STDIN>;
$notseenendhead=0 if($line =~ /<\/HEAD>/i);
next if(!($line=~/<META/));
$line=~/<META\s+NAME\s*=\s*"?([a-zA-Z0-9\.\-]+)"?\s+CONTENT\s*=\s*"?([^"]*)"?\s*./;
$name = $1;
$content = $2;
next if(!($name =~ /\./));
($schema,$identifier,$rest)=split(/\./,$name,3);
if($schema eq "DC") {
if($nodtd) {
print STDOUT <<"SGMLHEAD";
<!DOCTYPE dublinCore PUBLIC '-//OCLC//DTD Dublin core v.1//EN'>
<dublinCore>
SGMLHEAD
$nodtd=0;
}
print STDOUT " <$identifier";
while($rest ne "") {
($subelement,$value,$rest)=split(/\./,$rest,3);
print " $subelement='$value'";
}
print STDOUT ">$content</$identifier>\n";
}
}
print STDOUT "</dublinCore>\n" if(!$nodtd);
exit(0);
|