On Wed, 25 Sep 1996, Jon Knight wrote:
:-(
Great minds think alike...! (or fools seldom...)
Rapidly taking shape at http://www.ncl.ac.uk/~napm1/dublin_core/ is a
Newcastle stab at exactly the same thing...!
I've been working on this for a couple of days, with a LOT of help from
Tony McDonald (DESIRE) and Donal Hannah (Netskills), and wasn't going to
announce it until I had finished the frills...
This is my first foray into Perl/CGI, and so it is an excuse to learn a
bit of Perl as much as anything else...
Anyway, as Jon has announced his utility, I'll let you know about mine
now too, even though it's not quite finished...
Feel free to take a look and comment, but bear in mind that:
- it's under development, and so I may have broken it in the 30 seconds you
happen to be looking!
- I still need to fix:
- date coding for the CURRENT date
- the author's e-mail address (currently appears empty)
- I still need to add:
- required fields that are actually REQUIRED
- DC metadata actually ON each page
- a log of those using the system
> As I see things, one of the problems we've got with Dublin Core is
> getting a large base of DC metadata "out there". We've already decided
> that one of the biggest bits of "out there" at the moment is HTML
> documents. Therefore I've written a _very_ simple minded Perl script
> that lets you "inject" some DC metadata into an existing HTML document.
Yup... I had hoped to get this working, and then hit various lists with
the URL, plus a link to some papers (those in Ariadne and D-Lib?) to
raise awareness...
> The HTML document can be pulled straight from the web (the Perl script
> uses the libwww Perl module available at a CPAN archive near you) and
> writes a new HTML document out on its standard output that contains HTML
> 2.0 style <META> elements in the <HEAD>. It automatically inserts the
> TITLE into DC.title, the URL into DC.identifier and the text/html IMT
> into DC.form. You can also specify the author name, author email,
> keywords and language on the command line using the -a, -e, -k and -l
> arguments respectively.
>
> The intention with this script wasn't to create a general purpose DC
> processing tool. Instead the idea was to have something _now_ that
> injects at least a little basic metadata into existing HTML 2.0 documents
> nice and easily. If you've got lots of documents it might be a handy way
> of inserting some basic DC metadata into all of them.
Paul
/======================================================================\
| Paul Miller |
| Graphics & GIS Advisor, University Computing Service |
| University of Newcastle, Claremont Tower, Claremont Road, Newcastle |
| upon Tyne NE1 7RU. tel (0191) 222 8212/8039, fax (0191) 222 8765 |
| |
| e-mail [log in to unmask] WWW http://www.ncl.ac.uk/~napm1/ |
| [log in to unmask] http://www.ncl.ac.uk/~ngraphic/ |
\======================================================================/
#!/usr/public/bin/perl
#
# dc_inject.pl - Script to inject Dublin Core metadata into an HTML document
#
# Author: [log in to unmask]
#
# $Id: dc_inject.pl,v 1.1 1996/09/25 04:11:56 jon Exp $
#
# needs libwww-perl - cf. <URL:http://www.oslonett.no/home/aas/perl/www/>
# and Perl 5.
use LWP::UserAgent;
use HTML::Parse;
use URI::URL;
use Getopt::Std;
require HTML::Element;
getopts("a:b:de:hk:l:p:P");
if($opt_h) {
print STDERR <<"EndOfHelp";
Usage: $0 [-a author_name] [-b url_base] [-d] [-e author_email] [-h]
[-k keywords] [-l language] [-p proxy] [-P] <url>
EndOfHelp
exit(1);
}
$AuthorName = $opt_a || $ENV{AUTHORNAME} || "Unknown";
$AuthorEmail = $opt_e || $ENV{AUTHOREMAIL} || $ENV{USER}."\@localhost";
$Keywords = $opt_k || $ENV{KEYWORDS} || "none";
$Language = $opt_l || $ENV{LANGUAGE} || "en";
$debug = $opt_d || 0;
$BASE = $opt_b || "http://localhost/";
$ua = new LWP::UserAgent;
$ua->env_proxy unless $opt_P;
$ua->proxy(['http', 'ftp', 'gopher', 'wais'], $opt_p) if $opt_p;
$ua->agent('dc_inject.pl libwww-perl/5.00');
# Get the URL we want to add the metadata to.
$url = $ARGV[0];
$url =~ chomp;
print STDERR "Examining <URL:$url>...\n" if $debug;
# We're only interested in URL's of HTML documents
if ($url =~ /\.htm[l]$/) {
# Make relative URLs into absolute URLs.
if(!/^http:/i) {
$url = url($url, $BASE)->abs->as_string;
}
# Build a request to go and get the document
$req = new HTTP::Request 'GET', $url;
$req->header('Accept' => 'text/html');
# send request
$res = $ua->request($req);
# check the outcome
if ($res->is_success) {
$html = parse_html($res->content);
$intitle = 0;
$html->traverse(\&callback, 0);
} else {
print STDERR "Error: " . $res->code . " " . $res->message;
}
}
exit;
sub callback {
local($node,$startflag,$depth) = @_;
if (!($node =~ "^HTML::")) {
print "$node\n";
if($intitle) {
$title = $node;
$intitle = 0;
}
return(1);
}
if(($startflag == 1) && ($node->tag() eq "title")) {
$intitle = 1;
}
if(($startflag == 0) && ($node->tag() eq "head")) {
print STDOUT <<"EndOfMETA";
<!-- Dublin Core Metadata Package -->
<META NAME="beginpackage" CONTENT="Dublin Core">
<META NAME="DC.title" CONTENT=" $title">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#title">
<META NAME="DC.author" CONTENT="(type=name) $AuthorName">
<META NAME="DC.author" CONTENT="(type=email) $AuthorEmail">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#author">
<META NAME="DC.subject" CONTENT="(scheme=keywords) $Keywords">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#subject">
<META NAME="DC.form" CONTENT="(scheme=IMT) text/html">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#form">
<LINK REL=SCHEMA.imt HREF="http://sunsite.auc.dk/RFC/rfc/rfc1521.html">
<META NAME="DC.identifier" CONTENT="(scheme=URL) $url">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#identifier">
<META NAME="DC.language" CONTENT="(scheme=ISO.639) $Language">
<LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#language">
<META NAME="endpackage" CONTENT="Dublin Core">
EndOfMETA
}
if($startflag==1) {
print $node->starttag();
} else {
print $node->endtag();
}
return(1);
}
|