On Wed, 7 May 1997, Charles Wicksteed wrote:
> Underlying these questions is the assumption that the HTML file
> does not need to contain a complete DC record, as long as the robot
> can assemble a complete record for the search engine.
I don't think that there's any such thing as a "complete DC record" - all
of the elements in DC are optional. Of course you might have an
application that requires a fixed minimum set but that's a local thing.
> 1. In some cases the metadata relates to the body of the same HTML
> file, and in other cases the metadata is in an HTML file but refers
> to a separate PDF-format file. How should we indicate the difference
> between these to the search engine? The metadata for the PDF file
> would have the URL of the PDF file in the IDENTIFIER element.
> Perhaps the plain HMTL files should have the IDENTIFIER element
> omitted to indicate that the metadata refers to the file itself.
> In this case, we do not have to tell the robot the URL, as it is
> simply the URL of the page it is currently processing.
I'd put the URL of the PDF file in there _and_ put an A element with an
HREF pointing to it in the body of your HTML document. The reason is that
if you hit a non-DC aware search engine end users will get your HTML
document returned and so you want to at least give them a way of getting
to the PDF document.
Having said that, I was under the impression that the embedded-in-HTML
format of DC was only intended for "self-describing" HTML documents and
not as a generic way of attaching DC to other object types. For other
object types such as PDF, PostScript, images, etc I would use a different
format (such as PICS or a proper SGML DTD). Or is this just a sneaky way
of getting non-HTML resources indexed by HTML indexers?
> 2. In a similar vein, do the files all need to say that the FORMAT
> is text/html? The robot would not be reading them unless they were
> text/html anyway.
I'd put it in completeness but its probably not required _if_ the embedded
DC in the HTML document is actually for that HTML document. You seem to
have metadata that refers to another resource with a different type
(application/pdf in this case).
> 3. And what about TITLE? Should the robot simply extract it from
> the <TITLE> ... </TITLE> part of the HTML? If not, should we insist
> that both titles should be the same?
This is the danger of overloading HTML to carry metadata for other
resources. I'd steer clear of it and use PICS, SGML, etc, etc.
Tatty bye,
Jim'll
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU.
* I've found I now dream in Perl. More worryingly, I enjoy those dreams. *
|