As promised more than once to this list, the Dublin Core metadata User
Guide subgroup is ready to release a draft document suitable for review
by the meta2 list. The draft is at http://www.ckm.ucsf.edu/meta/mguide3.html
or, if you prefer, you can read the complete text included in this message.
There have been big changes. If you examined the Dublin Core minutely,
as the User Guide group did, you will have noticed that the patient
arrived at the Warwick workshop in critical condition. It had swallowed
elements without chewing, was making incoherent statements, repeating
itself, suffering from lack of structure, and hemorrhaging credibility.
The User Guide group is happy to announce that after significant surgery,
a healthier Dublin Core has been taken off the critical list, its condition
downgraded to serious. We managed to gut, clear, strengthen, and sharpen.
13 elements have been reduced to 10; plus a net reduction of one qualifier.
Elements and qualifiers have short, long, and numeric names.
There's a great deal of work to be done, especially w.r.t. controlled
vocabularies, further simplification, and further restriction of the
problem space (eg, isn't it time to start _requiring_ some elements?)
Members of the User Guide group are
Tom Baker [log in to unmask]
Stu Weibel [log in to unmask]
Bemal Rajapatiran [log in to unmask]
Priscilla Caplan [log in to unmask]
Frank Roos [log in to unmask]
Rebecca Gunther [log in to unmask]
John Kunze [log in to unmask]
What follows are a few notes about the changes, then the text of the User
Guide itself (or you can use the URL above). Comments are solicited.
I'll have spotty e-mail contact for most of the next 3 weeks, so you may
wish to contact someone from the group if you need an earlier response.
-John
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
John A. Kunze +1 415-502-6660 Mgr, Advanced Tech Group
530 Parnassus Ave, Box 0840 [log in to unmask] Library and Center for
San Francisco, CA 94143-0840 Fax: 415-476-4653 Knowledge Management
=-=-=-=-=-=-=-=-= University of California, San Francisco =-=-=-=-=-=-=-=-=
Some changes proposed for Dublin Core
-------------------------------------
1. eliminated Coverage
too specialized, some functions covered by flags qualifier, less is more
2. renamed ObjectType to Type
less jargon, why qualify this element (e.g., no ObjectAuthor?)
3. new element Contributor subsumes OtherAgent and Publisher
less jargon, fewer similar elements
4. subsumed Source into Relation
natural fits, less is more
5. renamed Identifier to Resource
Why:
(a) dunno any large legacy metadata systems that don't currently have
an internal field called something like Identifier (eg, "internal
identifier") to tag their own records, and expect they will lay
claim (legitimately) to the name very early
(b) URN/URL's will appear in many elements (e.g., Relation, misc. links
to external data and other metadata); in a metadata record with lots
of links, using "Identifier" is about as meaningful as using "Integer"
as a variable name in a program with lots of numbers
(c) leaving it as Resource, with the Content optionally _not_ a link
opens some way cool possibilites, eg, we can now insert a resource
inside it's own metadata (see example section 3.4)
6. eliminated the qualifiers "type" and "identfier"; added "flags" qualifier
- type (hopeless, never defined)
- identifier (confusing, functionality easily subsumed
- flags (succinct set of element hints)
Added to User Guide...
-------------------
7. took a stab at problem statement and scope restriction (section 5)
8. took stab at conceiving a controlled vocabulary of not only Content,
but also element names, qualifiers, and qualifier values
========== Text of User Guide (by Warwick User Guide Group) ========
DRAFT DRAFT DRAFT
Guide to Creating Core Descriptive Metadata
http://www.ckm.ucsf.edu/meta/mguide3.html
1. Purpose of this Guide
This document is intended to help people who have no training in cataloging
or indexing to create simple descriptive records for information resources
(for example, electronic documents). Creators of these records include
authors, editors, and World-Wide Web (WWW) [1] site administrators. The
descriptive information about an information resource is called metadata,
which simply means data about data.
A metadata record consists of a handful of elements designed to make it easy
to find the resource during an internet search. This guide discusses the
layout and content of Core Metadata elements, and how to use them in
composing a record. The Core favors documents because traditional text
resources are fairly well-understood, and it applies well to other resources
in proportion to how closely their metadata resembles typical document
metadata. So this guide works best for resources with document-like
metadata.
Core Metadata is a small set of easily created elements that can be applied
to most kinds of information resources. Also know as the Dublin Core, this
set was conceived by an international interdisciplinary group of
professionals from librarianship, computer science, and text encoding in
order to improve internet resource discovery. The consensus they achieved is
the basis of an evolving extended vocabulary of metadata.
Your metadata is first put to work when your record is submitted to an
indexing program. This can happen passively, by just leaving it in a file
where a site administration or Web crawler program regularly checks. That
program builds indexes based on your and other metadata records so that
subsequent searches, by scanning the indexes, will possibly match on
metadata that you distributed among the record elements. Records
corresponding to matching metadata make up the search results.
Your metadata is put to work for a second time when it is used to display
results to the searcher, who then makes a final selection that is only as
well-informed as your metadata is accurate, complete, and consistent with
that of all the other indexed records. One goal of this document is to
encourage metadata consistency as a way to promote complete retrieval and
intelligible display across widely disparate sources of descriptive records.
With inconsistent metadata, search results tend either to omit desired
records or to drown them out in a flood of irrelevant records. Effectively,
inconsistent metadata hides desired records.
The following section gives examples illustrating how to write down elements
so that software can make use of them. The guide then steps through
instructions for completing each element in the Core. The remaining sections
list the rules for writing down elements (element syntax), sources for
up-to-date element set summaries, standardized terms for describing content,
and related guidelines for metadata preparation. Finally, before the
glossary and references are a section describing the purpose of Core
metadata and a section on how to invent new elements.
2. Metadata Record Examples
In this section appear examples of the kinds of descriptive records that you
will be able to create using the instructions in this guide. They illustrate
general principles and teach something about element construction along the
way. There is more than one kind of metadata notation; the one used
throughout this guide is compatible with most Web software.
2.1. Example: Stand-Alone Metadata
The first example of a metadata record is contained in a computer file by
itself. It describes a photograph in another file that has a location given
by a Uniform Resource Locator (URL) [2]. The entire record file looks like
this.
<META name="dc.Title" content="Kita Yama (Japan)">
<META name="dc.Author" content="Kertesz, Andre">
<META name="dc.Date" content="1968">
<META name="dc.Type" content="e/photograph">
<META name="dc.Form" content="GIF">
<META name="dc.Resource" content="http://foo.bar.zaf/kertesz/kyama">
Each record element definition begins with ``<META'' and ends with ``>''. In
this example each definition happens to fit on one line, but in general a
definition can span several lines. You will notice that all element names in
this guide begin with the characters ``dc.'' to indicate their membership in
the Dublin Core. Alphabetic case is not significant except within element
Content (and even then usually only for display), so the first element
definition could have been written equivalently as
<meta NAME="DC.title" Content="Kita Yama (Japan)">
This record file is submitted to an indexing program by leaving it in a
conventionally agreed-upon area of a Web site. If that program is a
comprehensive Web crawler, it builds indexes based on your and other
metadata records drawn from all over the internet.
2.2. Example: Metadata Contained in a Resource
The next example of a metadata record is contained in a file along side the
document that it describes. The document is a short poem expressed in HTML,
the Web's Hypertext Markup Language [3].
<HTML>
<HEAD>
<TITLE>Song of the Open Road</TITLE>
<META name="dc.Title" content="Song of the Open Road">
<META name="dc.Author" content="Nash, Ogden">
<META name="dc.Type" content="e/document">
<META name="dc.Date" content="1939">
<META name="dc.Form" content="HTML">
<META name="dc.Resource" content="$File">
</HEAD>
<BODY><PRE>
I think that I shall never see
A billboard lovely as a tree.
Indeed, unless the billboards fall
I'll never see a tree at all.
</PRE></BODY>
</HTML>
Indexing programs understand that the metadata record starts after the
``<TITLE'' line and ends before the ``</HEAD'' line, and are thus able to
extract metadata automatically. Meanwhile, the metadata in the same file
does not appear during normal document formatting or printing, and
metadata-aware Web browsers may even be able to exploit it.
Another point of interest is the Resource (dc.Resource) Content, which
normally contains a URL showing the document's Web address (as in the
previous example). In this case, the characters ``$File'' probably mean that
the Web site has been set up cleverly to replace those characters with a URL
for the current location before the indexing program comes to collect
metadata. This is a hypothetical site-specific solution to a common problem
of maintaining metadata when the file location changes, and support for such
a mechanism is outside the scope of this guide.
2.3. Example: Metadata for a Non-Electronic Resource
Electronic indexing and search systems work equally well for locating
non-electronic and electronic resources. In this example, the resource
described is a musical score that does not exist in electronic form at all.
As in the first example, this record is contained in a computer file by
itself.
<META name="dc.Title" content="Six Sonatas for Harpsicord">
<META name="dc.Author; Role=composer"
content="Honauer, Leontzi; Schobert, Johann">
<META name="dc.Date" content="1775">
<META name="dc.Type" content="p/score">
<META name="dc.Resource; Scheme=LCCN"
content="M219.H65P">
<META name="dc.Resource; Scheme=none"
content="UCB Music Lib., Case X">
This record illustrates element definitions that span more than one line,
element Qualifiers, and repeated elements. The Author element indicates a
specialization of the author Role for this work (the resource) to that of
composer. The Author content actually lists two composers' names separated
by a semi-colon, which is a shorthand for listing two separate Author
elements. The last two element definitions show the long way of repeating an
element. In fact, these two Resource elements are required to appear in
separate definitions because they use a different qualifier (Scheme in this
case).
A scheme qualifier specifies the format of the element content. If not
included, the default Scheme for the Resource element dictates that the
content contains a URL, so for non-electronic resources a Scheme must be
specified. In this example, the first Resource element contains a US Library
of Congress Call Number (LCCN) and the second contains human readable
location instructions.
In the Type element (``p/score''), the prefix ``p/'' indicates a primary
resource type of physical. By contrast, the prefix ``e/'' of the previous
examples indicated a primary type of electronic data. The secondary type of
the resource is always listed after the prefix. Aggregate types can also be
used to indicate a resource that is actually a collection of resource all
having a specific basic type; aggregate types may seem exotic, but in fact
their metadata is similar enough to document metadata that they are
straightforward to deal with from the resource discovery point of view.
Details are given later under the Type element description section.
3. The Core Elements, Version 0.2
This section lists each Core element by its full, short, and numeric names
(described in the section on Element names), and includes its default
qualifiers. Default Qualifiers are in effect unless overridden by explicitly
specified Qualifiers. For each element there are guidelines to assist you in
creating metadata Content, whether you do it ``from scratch'' or by
converting an existing record in another format. Here are the Core elements.
ELEMENTS QUALIFIERS
author (au, 1) form (fm, 6) role (ro, 30)
title (ti, 2) language (la, 7) scheme (sc, 31)
subject (su, 3) resource (rs, 8) flags (fl, 32)
type (ty, 4) contributor (co, 9)
date (da, 5) relation (rn, 10)
It is not an error to include elements that not all software or people
understand how to use. There is never a guarantee that all Web browsers, for
example, will recognize all the Core elements. On the other hand,
unrecognized elements do no harm and will generally be displayed and
indexed. A later section on Element Syntax describes how you can prevent
that, if need be, through the ``ds'' Qualifier flags. It also explains the
purpose of semi-controlled vocabularies, such as the Resource Metadata
Vocabulary (RMV) referred to in the following element descriptions.
Creators of metadata, also known as ``metaloguers'', will not find all the
answers in these guidelines. Ultimately they will let their decisions be
guided by the overall goal of maximizing discoverablity of their own
resources. This guide documents version 0.2 of the Core, a descendant of the
original Dublin Core (version 0.1).
[ Note to Reviewers: The descriptions in the following sections
are still quite incomplete. ]
3.1. Author (au, 1)
Author: A creator of the work that is primarily responsible for its
intellectual content.
Default Qualifiers: role=writer, scheme=none, flags=p.
Authors may be names of people (default) or organizations (flags=-p). Names
of people that appear in Author are specified with family name first, then
given names, initials, honorifics, etc. If no comma appears in a name, the
family name is taken as the first word, otherwise it includes all words up
to the first comma. For example,
<META name="dc.Author" content="Kennedy John F.">
<META name="dc.Author" content="Mao, Tse Tung">
<META name="dc.Author" content="de la Madrid, Carmen">
If the name of an organization appears in Author, use the `-p' flag
qualifier to prevent futile attempts to distinguish family names from given
names in metadata such as
<META name="dc.Author;flags=-p" content="Digital Equipment Corporation">
<META name="dc.Author;flags=-p" content="Apple Computer, Inc.">
<META name="dc.Author;flags=-p"
content="United States Government, National Institutes of Health">
If the work's primary creator is not an author in the traditional sense, you
may use the ``role'' qualifier, as in
<META name="dc.Author; role=photographer" content="Lange, Dorothea">
<META name="dc.Author; role=composer" content="Sullivan Sir Arthur S.">
<META name="dc.Author; role=librettist" content="Gilbert, Sir William">
Other contributors to the work (e.g., translators, editors, publishers) may
be specified in the Contributor element. Search systems are likely to take
the liberty of combining Author and Contributor elements in one Author-like
index, so don't be surprised if searches on either element end up checking
both for matches.
3.2. Date (da, 5)
Date: The date of publication, and other important dates.
Default Qualifiers: role=publication, scheme=YMD.
Use the date of original publication, regardless of when it was last
modified (see ``modified'' role below). If original publication meant making
the work available electronically, use the date it was put online (e.g., a
Web page). If the work is a digitization of an original, use the date of
original publication. If the work is a ``living document'' whose content is
regularly replaced (as opposed to merely updated), use the date of last
replacement.
[ Note to Reviewers: Dates are important and multiple scheme
support complicates support and documentation. How about
discouraging all but the YMD scheme described here? ]
The default scheme for Date is YMD, which specifies a day in the format
``YYYYMMDD''. For example,
<META name="dc.Date" content="19950506">
specifies May 6, 1995. The format may be shortened to ``YYYYMM'' to indicate
only the month, or to ``YYYY'' to indicate only the year. It may also be
lengthened if more precision is needed, the full form being
``YYYYMMDD:HHMMSS.N'', where the hours (24-hour clock), minutes, and seconds
part after the colon is optional, as is the string of fractional seconds (N)
following the period. This format provides software with an unambiguous date
that is simple to convert to a variety of display formats (e.g., 6 May 1995)
and easily ordered with respect to other dates (e.g., for range checking or
chronological sorting).
The role Qualifier defaults to date of publication, as defined by the
publisher. Several other roles may be relevant, such as
* role=created, the time of creation or capture (e.g., when a photograph
was taken, when data from an automatic sensor was gathered)
* role=modified, the last time the resource was changed
* role=verified, the last time the resource was reviewed, whether changed
or not
You may express a time range using the the Flags Qualifier, as in
<META name="dc.Date; role=created; flags=b" content="19950506:091500">
<META name="dc.Date; role=created; flags=e" content="19950506:141500">
which specifies a five-hour period beginning at 9:15am and ending at 2:15pm.
3.3. Form (fm, 6)
Form: The electronic data format of the resource, used to identify the
software and possibly hardware needed to display or operate the resource.
Default Qualifiers: role=best1.0, scheme=rmv.
If the resource is not in electronic form, specify the content as ``none''.
If the resource is in an electronic form that is unknown, omit the element.
Some Content designations from the RMV scheme (which borrows heavily from
MIME [8] XXX) are listed here.
aiff etx man pgm rtf tar wav
avi gif me pnm rtx tcl xbm
bcpio gtar movie ppm sh tex xpm
bin hdf mpeg ps shar texinfo xwd
cdf html ms qt snd tiff zip
cpio ief oda ras src tsv
csh jpeg pbm rgb sv4cpio txt
dvi latex pdf roff sv4crc ustar
The default role Qualifier, ``best1.0'' indicates that this Form of the
resource is the best of the available Forms. There are exactly eleven such
role Qualifiers listed in the RMV: best0.0, best0.1, ..., best0.9, best1.0.
In a system reminiscent of Web content negotiation, they allow the
information provider to decide which Form of a resource is optimally rich in
terms of information conveyed. For example, if a document is available in
either HTML or plain text (which lacks the structural tagging of HTML), the
HTML Form would be assigned a role of ``best1.0'' while the plain text Form
might get ``best0.7'' to indicate a subjective degradation of 30% from the
optimum.
If there are multiple Resource URLs in the same record, you can associate a
different Form with each by pairing element definitions in the order they
appear in the record. For example, in the element list below, the second
Form pairs with the second Resource to indicate that it is judged to be 30%
poorer in information (though it may be cheaper, faster to download, or less
burdensome for the receiver) than the first.
<META name="dc.Resource" content="http://foo.bar.zaf/fancy">
<META name="dc.Resource" content="http://foo.bar.zaf/plain">
<META name="dc.Form; role=best1.0" content="html">
<META name="dc.Form; role=best0.7" content="txt">
3.4. Resource (rs, 8)
Resource: A string of characters that identifies a resource.
Default Qualifiers: role=whole, scheme=uri, flags=l.
The default scheme Qualifier is URI (e.g., a URL) with the default flag
indicating a link to the resource. Other schemes available in the RMV
include ISBN, ISSN, and SICI. By default, the role value of ``whole''
indicates that the Content identifies the entire resource. Other roles
available include ``abstract'', ``table of contents'', and ``outline''.
Some creators of metadata will wish to maintain links by automatic processes
that make it possible to move files around without changing the URIs
embedded in metadata. By convention, a Resource link Content of ``dummy''
may be used as a signal to those processes to replace the Content with a
correct URI.
For certain resources, such as very small documents, it may be useful to
include the entire document in the descriptive record. In this case, use the
``flags=-l'' Qualifier to indicate that the resource identified is available
immediately (or inline) as the Resource element Content. For example, here
is a tiny poem represented inside one its own metadata elements:
<META name="dc.Resource; flags=-l"
content="Candy/Is dandy/But liquor/Is quicker.">
3.5. Language (la, 7)
Language: The human or computer language in which the intellectual content
is expressed.
Default Qualifiers: role=resource, scheme=rmv.
If language is not relevant, omit. By default, the Content specifies the
language of the resource. A role Qualifier of ``metadata'' means that the
language of metadata element Content (e.g., Title, Subject) is being
identified instead of the language of the resource (the default).
Language is specified via a one- to three-letter code. Here is a partial
list of language codes in the RMV. The human language codes are based on ISO
639 [XXX].
ab Abkhazian fr French lt Lithuanian sd Sindhi
aa Afar fy Frisian mk Macedonian si Singhalese
af Afrikaans gl Galician mg Malagasy ss Siswati
sq Albanian ka Georgian ms Malay sk Slovak
am Amharic de German ml Malayalam sl Slovenian
ar Arabic el Greek mt Maltese so Somali
hy Armenian kl Greenlandic mi Maori es Spanish
as Assamese gn Guarani mr Marathi su Sundanese
ay Aymara gu Gujarati mo Moldavian sw Swahili
az Azerbaijani ha Hausa mn Mongolian sv Swedish
ba Bashkir iw Hebrew na Nauru tl Tagalog
eu Basque hi Hindi ne Nepali tg Tajik
bn Bengali, Bangla hu Hungarian no Norwegian ta Tamil
dz Bhutani is Icelandic oc Occitan tt Tatar
bh Bihari in Indonesian or Oriya te Tegulu
bi Bislama ia Interlingua om (Afan) Oromo th Thai
br Breton ie Interlingue ps Pashto, Pushto bo Tibetan
bg Bulgarian ik Inupiak fa Persian ti Tigrinya
my Burmese ga Irish pl Polish to Tonga
be Byelorussian it Italian pt Portuguese ts Tsonga
km Cambodian ja Japanese pa Punjabi tr Turkish
ca Catalan jw Javanese qu Quechua tk Turkmen
zh Chinese kn Kannada rm Rhaeto-Romance tw Twi
co Corsican ks Kashmiri ro Romanian uk Ukrainian
hr Croatian kk Kazakh ru Russian ur Urdu
cs Czech rw Kinyarwanda sm Samoan uz Uzbek
da Danish ky Kirghiz sg Sangro vi Vietnamese
nl Dutch rn Kirundi sa Sanskrit vo Volapuk
en English ko Korean gd Scots Gaelic cy Welsh
eo Esperanto ku Kurdish sr Serbian wo Wolof
et Estonian lo Laothian sh Serbo-Croatian xh Xhosa
fo Faeroese la Latin st Sesotho ji Yiddish
fj Fiji lv Latvian, Lettish tn Setswana yo Yoruba
fi Finnish ln Lingala sn Shona zu Zulu
bas Basic for FORTRAN .....
c C jav Java
cpp C++ tcl TCL
3.6. Type (ty, 4)
Type: The abstract category of the resource, such as poem, film, database,
etc.
Default Qualifiers: role=none, scheme=rmv.
The Type element delineates broad abstract categories of resource usage
designed to assist resource discovery. Note that this is different from the
Form element, which identifies a concrete electronic data format. The Type
consists of a primary type and a secondary type separated by a `/'. The
three primary base types are given by letters:
Primary Base Type: e=electronic
Electronic data suitable for download and simple read or playback. Most
resources on the Web are plain HTML, text, or bitmap images fitting
into this primary Type. Video and audio clips that require only simple
playback also fit in this Type, as do data newly generated by server
programs upon each access (e.g., weather reports, randomly selected
proverbs, stills from a moving camera).
Primary Base Type: a=active
Active resource or program operated by the consumer, such as a database
search or a game. If accessing a resource involves significant user
operation (other than scroll or, for video, fast forward, rewind,
etc.), then specify this as the primary Type. Examples include any
telnet or mailto URL identifies a resource that is considered
``active''. An important class of examples are HTML forms that provide
input services such as database search, user feedback, or goods
purchase. This is not to say that any HTML document that allows for
input is necessarily of this Type. It is the metaloguer's job to assign
a Type that maximizes the resource's chances of being found by a
typical search for resources of this kind. Therefore, an HTML form
containing 1500 lines of prose and a small standardized input area
(e.g., for submitting reader comments) would not be considered an
``active'' resource for the purposes of resource discovery.
Primary Base Type: p=physical
Physical object or non-electronic resource. This includes books, art
works, artifacts, merchandise, etc.
A base type may optionally be preceded by an aggregate type when the
resource consists of a collection of resources of that base type. The two
aggregate types are also given by letters:
Aggregate Type: cX (c=collection)
Collection of resources of primary base type X. Examples include a
bibliographic database and a set of photographs.
Aggregate Type: gX (g=graph)
Graph or hierarchy of resources of primary base type X. One example is
a Web site of type ``ge/document'', which is an entry point into a set
of interlinked resources of base type ``e/document''. Another example
is a stream-encoded filesystem hierarchy of type ``ge/software
source'', such as a shareware ZIP [XXX] archive. Note that while a very
large number of Web documents are interlinked, and hence could be
viewed as entry points into an aggregate of documents, it is up to the
metaloguer to decide if that view actually strengthens discoverablity
of the resource.
The following three Type definitions correspond, respectively, to an
electronic map, a collection of offline books, a set of interlinked
documents, and a database of records:
<META name="dc.Type" content="e/map">
<META name="dc.Type" content="cp/book">
<META name="dc.Type" content="ge/document">
<META name="dc.Type" content="ce/record">
The secondary Type takes terms from the controlled vocabulary RMV unless
another vocabulary is given with the scheme Qualifier. If no Type element is
defined in a record, ``e/document'' is assumed. Some common secondary types
from the RMV are listed here.
Secondary Type: document
Examples include articles, reports, essays, guides, journals, memos,
dissertations, poems, and stories.
Secondary Type: book
[XXX how do you say in any meaningful way how this differs from a
document?]
Secondary Type: video
Simple video recordings.
Secondary Type: audio
Simple audio recordings.
Secondary Type: record
A group or row of related data elements.
Secondary Type: map
A drawing conveying physical locations of objects.
Secondary Type: photograph
An photograph that is more important as an image than for the subject
it captures.
Secondary Type: artwork
Includes photographic images of designs, drawings, and paintings.
Secondary Type: score
Images of compositions in musical notation.
Secondary Type: software source
Source code for a program.
Secondary Type: software exectuble
An executable program.
A role Qualifier of ``genre'' is used to specify a tertiary resource
category but this time without any prefixes. [XXX can we borrow some library
cataloging lists?]
3.7. Contributor (co, 9)
Contributor: A person or agency, other than Author, who has made noteworthy
secondary contributions to the resource.
Default Qualifiers: role=publisher, scheme=rmv.
This is similar to Author, but indicates a level of participation
subordinate to that of an Author. Contributor is likely to be indexed along
with Author. Use the role Qualifier (publisher by default) to specify
different kinds of contributors. Examples of roles from the RMV include
composer editor librettist photographer translator
distributor illustrator mirror publisher
For the role of publisher, the Content can specify a publisher in the
traditional sense or the electronic publisher. The emphasis is on who
originally made the intellectual content of the resource available to an
audience. For example, if the work is a digitization of a formally published
book, give the formal publisher. You may specify non-original publishers
using roles such as ``distributor'' or ``mirror''.
3.8. Relation (rn, 10)
Relation: An important relationship between the resource being described and
another resource.
Default Qualifiers: role="derived from", scheme=uri, flags=l.
The relationship is given by the role Qualifier and defaults to ``derived
from'' if omitted. The other resource is identified by the Content in a
manner similar to the Resource element: the default scheme Qualifier is URI
(e.g., a URL) with the default flag indicating a link to the resource. Other
schemes available in the RMV include ISBN, ISSN, and SCCI.
For the role Qualifier you may specify relationship values from the RMV,
some of which are given below. To understand the sense, construct a sentence
with the resource being described first, then the relationship, and then the
second resource.
contained in derived from precedes supplement for translated from
contains source for follows supplemented by translated to
3.9. Subject (su, 3)
[ Note to Reviewers: Subject is very important and very difficult.
The RMV is imagine to provide part of the foundation for it. ]
Subject: The topic or topics of the resource being described.
Default Qualifiers: role=none, scheme=rmv.
Without a qualifier, a subject is assumed to contain uncontrolled
descriptive terms (phrases). If at all possible, use a controlled or
semi-controlled vocabulary; in this case, give the appropriate scheme
Qualifier, for example
<META name="dc.Subject; scheme=MeSH" content="heart">
To put an abstract or summary in a Subject, use the Qualifier
``role=abstract''. Use semi-colons as a shorthand to include more than one
subject phrase per element Name, as in
<META name="dc.Subject" content="Metadata"; Resource Description>
3.10. Title (ti, 2)
Title: The name of the work being described.
Default Qualifiers: role=none, scheme=none.
If there is a title or an obvious substitute, use it. Sometimes a
descriptive first line or sentence is a reasonable substitute. If the work
is in HTML and has a Title tag, its content may be useful. In cases where
there is no obvious substitute, omit the Title. Repeat the element if there
is more than one title. Examples:
<META name="dc.Title" content="OCLC/NCSA Metadata Workshop Report">
<META name="dc.Title" content="The Jumping Frog of Calaveras County">
4. Basic Principles of Descriptive Elements
The notation (one of several) described in this guide is based on the HTML
META tag. The character set assumed is standard UNICODE [XXX] with the UTF-8
[XXX] encoding. This allows for a very wide range of writing systems while
remaining compatible with the traditional ASCII [XXX] character set. All
element names begin with the three characters ``dc.'' to indicate their
membership in the Dublin Core.
4.1. Element Parts and Syntax
Each descriptive element definition has a Name part and a Content part, as
in
<META name="dc.Author" content="Browning, Elizabeth">
Sometimes the Name contains one or more Qualifier parts, each separated from
the next by a semi-colon (`;'). A Qualifier part consists of a Qualifier and
Value with an equals sign (`=') in between, as in
<META name="dc.Author; role=composer" content="Sullivan, Arthur">
In summary, an element definition looks like
<META name="dc.NAME; QUALIFIER=VALUE" content="CONTENT">
where the ``; QUALIFIER=VALUE'' part is optional and may be repeated. Spaces
around the `;' are optional.
Any metadata element may be omitted or repeated. As a shorthand for repeated
elements, one element can share its Name part (including Qualifiers, if any)
with a second element by appending a `;' (semi- colon) and the Content of
the second to the Content of the first. So the first two lines below are
equivalent to the third line.
<META name="dc.Author" content="Marx, K">
<META name="dc.Author" content="Engels, F">
<META name="dc.Author" content="Marx, K; Engels, F">
To repeat elements with different Qualifiers, list each element definition
separately; the shorthand cannot be used. Use a backslash, as in ``\;'', if
you want to include a literal semi-colon in the Content. To include a
literal double-quote (`"') in this metadata notation, write it as %22 [XXX].
Within a descriptive record, element order has no shared meaning, but for
repeated elements the relative order (e.g., among multiple authors) should
be preserved by display software.
4.2. Element and Qualifier Names
[ Note to Reviewers: Big changes here. Besides the full name,
there is now a short name and a numeric name (explanation below).
]
Each element and qualifier has three names, all case-insensitive: full,
short, and numeric. A full name may be any number of letters and digits
beginning with a letter. Full names, which appear in all the examples so
far, are pronounceable and suggestive of function. They are designed to be
easy to use in conversation and clear to read in texts and examples.
The short name is from one to four letters beginning with a letter. It
suggests the element function but takes less space in stored records than
the full name. Short names are also convenient ways to refer to search
indexes in queries entered via a keyboard. Such queries can be stored in
readable text files and run non-interactively (e.g., from email robots, user
profiles, and automatic periodic search agents).
The third name is actually a number that is decimal (base 10) by default, or
hexadecimal (base 16) if it begins with ``0x''. It provides a compact,
unambiguous, language-independent way of identifying an element while hiding
any obvious element function. Numeric names may range from 1 to
2,147,483,648 (each number fitting in 32 signed bits). There is no
particular significance to the numeric values assigned to names; for
example, clustered values do not imply related elements, low (high) values
do not convey special status, etc.
The following are all equivalent ways of referring to ``Relation'':
Relation RELATION reLAtIOn reln reLN Reln 10 0xa 010 0XA
At the time of writing, these are the Core element and qualifier names.
ELEMENTS QUALIFIERS
author (au, 1) form (fm, 6) role (ro, 30)
title (ti, 2) language (la, 7) scheme (sc, 31)
subject (su, 3) resource (rs, 8) flags (fl, 32)
type (ty, 4) contributor (co, 9)
date (da, 5) relation (rn, 10)
Because hyphens and punctuation are not allowed, robust name name
recognition is possible even if names are altered in transcription. For
example, suppose a literal-minded user ``cuts and pastes'' from quotation
mark to quotation mark in following this hypothetical instruction fragment:
... in the input box labeled Index enter the word "Au-
thor." Then select the box labeled Term and enter ...
The recipient (a program or person) of the ``pasted'' element name would
unfortunately also receive quotation marks, a hyphen, a newline, spaces, and
a period. Knowing what characters are not allowed is critical to correct
interpretation of such names.
4.3. Element Content and Semi-Controlled Vocabularies
Most Content data is generally created as visible ASCII characters (letters,
digits, punctuation). Upon upload into indexing systems, alphabetic case
sensitivity in Content is usually lost, and sequences of punctuation and
whitespace (e.g., spaces, tabs) in text are often collapsed into one space
character. Case and punctuation are usually preserved upon output by display
software.
Content data may be restricted to a ``controlled vocabulary'', which is a
limited set of consistently used and carefully defined terms. This can
dramatically improve search results because computers are good at matching
words character by character but weak at understanding the way people refer
to one concept using different words (having very different characters).
Without basic terminology control, inconsistent or incorrect metadata can
profoundly degrade the quality of search results.
One cost of a controlled vocabulary is in needing an administrative body to
review, update, and disseminate the vocabulary. For example, the US Library
of Congress Subject Headings (LCSH) [4] and the US National Library of
Medicine Medical Subject Headings (MeSH) [5] are formal vocabularies
indispensable for searching rigorously cataloged collections and both
require significant support organizations. Another cost is in having to
train searchers and creators of metadata so that they know when using MeSH,
for example, to enter ``myocardial infarction'' instead of the more
colloquial ``heart attack''.
Basic terminology control for internet resource discovery can be had through
use of a ``semi-controlled vocabulary'', which consists of a stable base of
officially approved terms augmented by an informally evolving set of
commonly used terms [XXX what about stuff like automatically generated
co-occurrence lists?]. For example, an agency choosing to administer a
semi-controlled vocabulary might use the LCSH as a stable base, and augment
it with a separate database of terms that is publically writable in order to
keep the administrative burden low. To make such an effort succeed, creators
and searchers must have ready access to all up-to-date terms and the agency
must be responsive to vocabulary changes called for by the community.
In this guide there is a semi-controlled vocabulary called the Resource
Metadata Vocabulary (RMV) [6] [XXX] that is put to a slightly expanded
purpose. Instead of specifying just Content terms, it also provides a
comprehensive list of all metadata element Names, Qualifiers, and Values.
You may query the RMV for the most current term lists and definitions at any
time by accessing the URL http://www.ckm.ucsf.edu/rmv.html [XXX not yet].
4.4. Kinds of Element Name Qualifiers
As mentioned, an element Name part may sometimes include one or more
Qualifier parts separated by semi-colons, each having the form
QUALIFIER=VALUE. Qualifiers and Values belong to the semi-controlled
vocabulary RMV, just as element Names do. Here is a summary of the base
Qualifier set with some common Values indicated. Every Qualifier has a
default value that depends on the element that it qualifies, but not every
Qualifier makes sense with each element.
Element Qualifier: role (ro, 30)
Distinguishes a subtype of the element. Examples: writer, composer,
photographer, librettist, translator, editor, publisher, illustrator.
Element Qualifier: scheme (sc, 31)
The syntax or controlled vocabulary used to convey the metadata
Content, which is different from the resource content (which is
described by the Form element). For most elements, scheme defaults to
``none'', meaning uncontrolled displayable, indexable data. Examples:
HTML 1.0 RMV LCCN LCSH MESH SICI URI
Element Qualifier: flags (fl, 32)
This Qualifier succinctly encapsulates a number of advisory hints
useful to display and indexing software. It consists of one or more
letters, each of which flags a basic aspect of the element content. A
letter is optionally preceded by a `-' in order to turn off the hint. A
preceding `+' turns it on, as if the letter appeared all by itself. In
the absence of an explicit flag letter, default meanings apply to all
elements as indicated below, and any additional element-specific
defaults are cited in the element description. The letters may be used
in any order; for example, ``flags=-d-s+l'' is equivalent to
``flags=l-s-d''. Letters and their meanings follow.
Flag=d
Unless the display software knows otherwise, it is advised not to
display the Content (e.g., it may be an obscure administrative
number that would only confuse a user). By default, all elements
are assumed to be displayable.
Flag=s
Unless the indexing software knows otherwise, it is advised not to
index the Content for searching (e.g., it may be an icon bitmap
that makes little sense to search). By default, all elements are
assumed to be indexable for searching.
Flag=l
The element Content specifies a URL or URN [XXX] that links to
data that is the real content. In most cases, element Content is
immediately available in the record without requiring a link to be
followed. Two important exceptions are the Resource and Relation
elements, for which the Content part contains a link by default.
To make a resource immediately available (that is, within the
metadata record itself), specify ``flags=-l'' with these elements.
Flag=p
Indexing software is advised to interpret the element Content as
the name of a person following the format described later in the
Author element section (that uses a comma to delineate the family
name). By default, elements are not assumed to contain names of
persons, however, for the Author element in particular, this
assumption is reversed. You may override the prevailing assumption
elements by explicitly turning this flag on or off.
Flag=b
The element Content specifies the beginning of a range associated
with the resource being described (e.g., the start Date of a set
of measurements). By default, an element is not associated with a
range.
Flag=e
The element Content specifies the end of a range associated with
the resource being described (e.g., the stop Date of a set of
measurements). By default, an element is not associated with a
range.
[ Note to Reviewers: The `g' flag is weak and ripe for
elimination. It may yet indicate a way to allow for some
geospatial tagging. ]
Flag=g
The element Content specifies a geospatial region associated with
the resource being described. By default, elements are not
associated with geospatial regions. For example, here's a record
for a map that has a conventional as well as a second title that
is rather more useful for searching than for display (to the
average searcher).
<META name="dc.Title" content="Great Barrier Reef Marine Park">
<META name="dc.Title; flags=g; scheme=FGDC"
content="E1800000, W1800000, N900000, S900000">
<META name="dc.Date" content="19960912">
<META name="dc.Type" content="e/map">
5. Purpose of Core Metadata: Resource Discovery
Resource discovery is the problem of finding just the right resources to
answer a particular question. Ideally, this means finding not only all the
relevant answers, but also not having them diluted by lots of irrelevant
answers. Answer sets rarely consist of the resources themselves (they would
get too big), but rather of metadata records that link to resources. There
are two main steps in discovery.
(a) Ask questions (search) until the answer (result) set is small
enough
(b) visually to scan the answers (records) to select the desired
resources.
The first step demands that certain record elements be easily gathered and
used by automatic indexing programs (such as Web crawlers) in order to build
indexes needed by search programs. As already mentioned, a metadata record
that you create is submitted to an indexing program, perhaps passively, by
just leaving it in an agreed-upon file. That program builds indexes based on
your and other metadata records so that subsequent searches scanning the
indexes will possibly match on your metadata and include your record in the
answer set. The indexing and search programs, whether local or
internet-wide, are not otherwise discussed here (some internet-wide examples
are AltaVista[2], Lycos[3], Yahoo[4], etc.).
The second step imposes a separate requirement that some metadata elements
be displayable and coherent enough to inform the searcher to discard or to
select each resource. People usually prefer to make final selection
decisions themselves because it's too hard to tell the computer exactly what
the criteria will be before seeing the answer set. My decision might, for
example, be in reaction to a note you added to your metadata record that was
too anomalous to be exploited by the indexing and search system.
[ Note to Reviewers: The next sentence makes a strong statement
that would allow us to limit the scope of the DC effort. It may
not be strong enough, however, it would be liberating to adopt a
more limited approach. ]
There are other uses of metadata and other aspects of resource discovery,
but Core Metadata was designed only to support the two steps above.
Moreover, the Core provides a foundation for generic but uniform
internet-wide resource discovery rather than for rich, comprehensive
resource descriptions. As a result, this guide does not discuss metadata for
such things as cost accounting, conditions of use, or discipline-specific
descriptive information. The next section briefly explains where to find out
about non-Core elements and how to publicize your own experimental elements.
6. Non-Core Metadata and Creating Your Own Element Names
[ Note to Reviewers: A good search interface to a metadata
dictionary is vital to the success of metadata. It should list
terms, find similar terms, allow easy informal registration of new
elements -- everything that was defined for a semi-controlled
vocabulary. The hardest part is a process of blessing new terms as
official. This entire section is pure fantasy right now. ]
Sometimes you will have metadata but no appropriate Core element Name,
Qualifier, or Value. As a last resort you can simply create and use your own
experimental Names, Qualifiers, and Values -- as mentioned earlier it is not
an error to include elements that not all software or people understand how
to use. The drawback that no one else understands these parts of your
metadata records defeats the goal of internet resource discovery. In
general, before creating your own terms you will want to check if any other
terms in the RMV (Core or non-Core) meet your needs. One way to do so is to
access the URL http://www.ckm.ucsf.edu/rmv.html [XXX not yet] to search
through its database of existing terms.
The RMV contains Core, non-Core, and shared experimental terms (i.e.,
publicized interim terms). Non-Core terms look just like Core terms, but
they cover many fields of knowledge and identify more than simple
descriptive aspects of resources. When two terms are spelled identically,
they are distinguished in the vocabulary by a suffix such as ``.biology'' or
``.music''.
Experimental metadata terms are either Shared Experimental or Private
Experimental. Private Experimental terms may be created and entered into the
RMV database at will by any organization that wishes to sponsor them. The
terms all start with ``x.ORG.'', where ORG is replaced by the organization's
abbreviation. For example, ``x.ucsf.library'' and ``x.ucsf.cafeteria'' might
be campus locations sponsored by UCSF. When a Private term has achieved
sufficient community acceptance, it becomes Shared Experimental, in which
case it loses the organization abbreviation, as in ``x.library'' and
``x.cafeteria''.
7. Glossary
XXX
8. References
* [1] Web http://www.w3.org/
* [2] AltaVista
* [3] Lycos
* [4] Yahoo
* [2] URL http://www.w3.org/pub/WWW/Addressing/
* [3] HTML http://www.w3.org/pub/WWW/MarkUp/
* [4] US Library of Congress Subject Headings (LCSH)
* [5] Medical Subject Headings (MeSH)
* [6] Resource Metadata Vocabulary (RMV) XXX
* [7] OCLC/NCSA Metadata Workshop Report
* [8] MIME
* [9] ISO 639 The Registration Authority for ISO 639 is Infoterm,
Osterreiches Normungsinstitut (ON), Postfach 130, A-1021 Vienna,
Austria.
* [X] UNICODE
* [X] UTF-8
* [X] ASCII
|