Hi all,
As I see things, one of the problems we've got with Dublin Core is
getting a large base of DC metadata "out there". We've already decided
that one of the biggest bits of "out there" at the moment is HTML
documents. Therefore I've written a _very_ simple minded Perl script
that lets you "inject" some DC metadata into an existing HTML document.
The HTML document can be pulled straight from the web (the Perl script
uses the libwww Perl module available at a CPAN archive near you) and
writes a new HTML document out on its standard output that contains HTML
2.0 style <META> elements in the <HEAD>. It automatically inserts the
TITLE into DC.title, the URL into DC.identifier and the text/html IMT
into DC.form. You can also specify the author name, author email,
keywords and language on the command line using the -a, -e, -k and -l
arguments respectively.
The intention with this script wasn't to create a general purpose DC
processing tool. Instead the idea was to have something _now_ that
injects at least a little basic metadata into existing HTML 2.0 documents
nice and easily. If you've got lots of documents it might be a handy way
of inserting some basic DC metadata into all of them.
Feel free to hack it about to your own ends. Feedback welcome.
Tatty bye,
Jim'll
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU.
* I've found I now dream in Perl. More worryingly, I enjoy those dreams. *
#!/usr/public/bin/perl
#
# dc_inject.pl - Script to inject Dublin Core metadata into an HTML document
#
# Author: [log in to unmask]
#
# $Id: dc_inject.pl,v 1.1 1996/09/25 04:11:56 jon Exp $
#
# needs libwww-perl - cf. <URL:http://www.oslonett.no/home/aas/perl/www/>
# and Perl 5.
use LWP::UserAgent;
use HTML::Parse;
use URI::URL;
use Getopt::Std;
require HTML::Element;
getopts("a:b:de:hk:l:p:P");
if($opt_h) {
print STDERR <<"EndOfHelp";
Usage: $0 [-a author_name] [-b url_base] [-d] [-e author_email] [-h]
[-k keywords] [-l language] [-p proxy] [-P] <url>
EndOfHelp
exit(1);
}
$AuthorName = $opt_a || $ENV{AUTHORNAME} || "Unknown";
$AuthorEmail = $opt_e || $ENV{AUTHOREMAIL} || $ENV{USER}."\@localhost";
$Keywords = $opt_k || $ENV{KEYWORDS} || "none";
$Language = $opt_l || $ENV{LANGUAGE} || "en";
$debug = $opt_d || 0;
$BASE = $opt_b || "http://localhost/";
$ua = new LWP::UserAgent;
$ua->env_proxy unless $opt_P;
$ua->proxy(['http', 'ftp', 'gopher', 'wais'], $opt_p) if $opt_p;
$ua->agent('dc_inject.pl libwww-perl/5.00');
# Get the URL we want to add the metadata to.
$url = $ARGV[0];
$url =~ chomp;
print STDERR "Examining <URL:$url>...\n" if $debug;
# We're only interested in URL's of HTML documents
if ($url =~ /\.htm[l]$/) {
# Make relative URLs into absolute URLs.
if(!/^http:/i) {
$url = url($url, $BASE)->abs->as_string;
}
# Build a request to go and get the document
$req = new HTTP::Request 'GET', $url;
$req->header('Accept' => 'text/html');
# send request
$res = $ua->request($req);
# check the outcome
if ($res->is_success) {
$html = parse_html($res->content);
$intitle = 0;
$html->traverse(\&callback, 0);
} else {
print STDERR "Error: " . $res->code . " " . $res->message;
}
}
exit;
sub callback {
local($node,$startflag,$depth) = @_;
if (!($node =~ "^HTML::")) {
print "$node\n";
if($intitle) {
$title = $node;
$intitle = 0;
}
return(1);
}
if(($startflag == 1) && ($node->tag() eq "title")) {
$intitle = 1;
}
if(($startflag == 0) && ($node->tag() eq "head")) {
print STDOUT <<"EndOfMETA";
<!-- Dublin Core Metadata Package -->
<META NAME="beginpackage" CONTENT="Dublin Core">
<META NAME="DC.title" CONTENT=" $title">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#title">
<META NAME="DC.author" CONTENT="(type=name) $AuthorName">
<META NAME="DC.author" CONTENT="(type=email) $AuthorEmail">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#author">
<META NAME="DC.subject" CONTENT="(scheme=keywords) $Keywords">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#subject">
<META NAME="DC.form" CONTENT="(scheme=IMT) text/html">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#form">
<LINK REL=SCHEMA.imt HREF="http://sunsite.auc.dk/RFC/rfc/rfc1521.html">
<META NAME="DC.identifier" CONTENT="(scheme=URL) $url">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#identifier">
<META NAME="DC.language" CONTENT="(scheme=ISO.639) $Language">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#language">
<META NAME="endpackage" CONTENT="Dublin Core">
EndOfMETA
}
if($startflag==1) {
print $node->starttag();
} else {
print $node->endtag();
}
return(1);
}
|