On 3 Nov 2005, at 14:34, Pete Johnston wrote:
> Good stuff.
Thanks Pete. I would like to direct follow-ups to DC-GENERAL if
possible.
> I confess I haven't looked at the options for configuring your
> extractor, but I wondered whether you had considered adding support
> for GRDDL [1]?
It has been a case of first things first to date. It has been important
to set up a framework in which it is relatively easy to add handlers
for other data formats (more technical details below), and other
application profiles.
The majority of structured HTML metadata on the Web follows a
scheme that is compatible with the DC recommendation. We
particularly wanted to meet the UK e-GMS standard for metadata,
and that follows form.
Though there has been skepticism about the implementation of the
e-GMS policy, if you look in the right places UK government Web
sites are a significant source of structured metadata these days.
The pity is that the search engines these sites use don't seem to
do justice to that data.
MKSearch is not intended to replace the type of free text searches
that general users have grown accustomed to, but should enable
those who have taken the trouble to catalogue their content to see
the fruits of their labour, and do so more precisely.
> That way you wouldn't be limited to extracting RDF data embedded
> according to the conventions described by the DC-in-X/HTML spec, but
> you could extract RDF data embedded according to any set of
> conventions that was identified by an HTML profile and was
> GRDDL-enabled (see e.g. the recent thread here on "Naked Metadata",
> especially Alan Cox' message [2], and Ian Davis' "Embedded HTML" [3]
> as an example) - and (when DCMI gets around to GRDDL-enabling it,
> which is in the pipeline) that would include the case of the
> DC-in-X/HTML spec.
>
> It seems to me GRDDL offers a very flexible approach to
> encoding/extracting RDF data in/from XHTML (and indeed from other XML
> formats).
Yes, I have been following these discussions. The beta release is
partly intended to gather suggestions like this and weigh up the
priorities. Our current project funding from a Department for Trade
and Industry SMART award expires in January, so we're looking for
commercial opportunities too!
Here's the science bit...
MKSearch indexing is provided by Java classes that fulfil the
Simple API for XML (SAX) ContentHandler interface. They must
also fulfil our extension to that interface called an
RdfContentHandler, which does things like expanding DC.Title into
a URI value for instance.
https://svn.mkdoc.com/mksearch/doc/javadoc/com/mkdoc/sax/Rdf
ContentHandler.html
SAX content handlers are designed to process a stream of XML
parse events with content, but it is also possible to write parsers
that process other content types and emit SAX events. The UK
Information Asset Register [1] is a text-format implementation of
the e-GMS that could be indexed with MKSearch, for example.
Best regards,
Phil
[1] http://www.opsi.gov.uk/iar/ see example record:
http://www.highways.gov.uk/aboutus/iar/iar4.txt
--
MKSearch (beta)
http://www.mksearch.mkdoc.org/
Free, open source metadata search engine with RDF storage and query.
|