We are in the process of defining the metadata to be put into web
pages for processing by our search engine, and have come across a few
related cases where we are not sure what to say. We would be
interested to know what other people have done in these cases.
Once there is a consensus on this, it would be helpful to add some
guidelines on these points to the RFC.
Underlying these questions is the assumption that the HTML file
does not need to contain a complete DC record, as long as the robot
can assemble a complete record for the search engine.
1. In some cases the metadata relates to the body of the same HTML
file, and in other cases the metadata is in an HTML file but refers
to a separate PDF-format file. How should we indicate the difference
between these to the search engine? The metadata for the PDF file
would have the URL of the PDF file in the IDENTIFIER element.
Perhaps the plain HMTL files should have the IDENTIFIER element
omitted to indicate that the metadata refers to the file itself.
In this case, we do not have to tell the robot the URL, as it is
simply the URL of the page it is currently processing.
2. In a similar vein, do the files all need to say that the FORMAT
is text/html? The robot would not be reading them unless they were
text/html anyway.
3. And what about TITLE? Should the robot simply extract it from
the <TITLE> ... </TITLE> part of the HTML? If not, should we insist
that both titles should be the same?
Charles and Misha
===
|