Dear all,
Many thanks to Philip for opening a discussion on the use of numbers
as tokens that represent metadata elements in a language-neutral way
(see email below). I have been looking back through old email and
meeting notes to remind myself what we have said about this in the past.
There are many dimensions to this question; where to begin?
Let me first put this into context. The issue
in question is Point 5 of the position paper at
http://purl.org/dc/documents/working_drafts/wd-i18n-current.htm, which
currently reads:
5. As defined in the RFC for Unqualified Dublin Core [RFC],
Dublin Core elements consist of a descriptive name (eg, "Author
or Creator"), a single-word label or "token" for use in encoding
schemes (eg, "Creator"), and element definitions. The descriptive
names and element definitions are meant to be read primarily
by humans, the tokens primarily by machines. The tokens look
like English words but stand for universal elements. Universal
elements can have interchangeable names and definitions in
multiple languages.
In other words, we felt that the machine-readable tokens for Dublin Core
metadata should be words like "Title" and "Creator". We also considered
using numbers to represent these concepts, but we realized that using
numbers implied the creation and maintenance of a controlled list of
numbers. Since the numbers would not be as readable as English words,
we felt that they could introduce another source of potential errors.
This was based on the assumption, now open to reconsideration, that at
least some of the metadata would be read and debugged by hand.
In projects at Washington State and with the Russian Federation, Philip
has been advocating the use of "Z numbers" that stand for the attributes
of the Bib-1 attribute set for Z39.50 (see below). He hesitates to
recommend that his Russian colleagues, for example, mix English tokens
with Cyrillic -- e.g. using "Creator" for top-level concepts, say, and
"Fotograf" (in Cyrillic) for more specific qualifiers or elements.
(I trust Philip will correct me if this is a bad example.)
If I understand his position correctly, Philip is recommending that words
in English or Russian be used as local tags and that universal numbers
for a global set of semantics be layered on top of (i.e., in addition to)
these local terms. He wonders whether we might reevaluate the position
as stated above and come to a more general agreement about the use of
numbers as tokens that stand for element semantics across languages.
I believe he pictures something like a master namespace, which would
have a defined list of primary data element concepts (DECs), each with
an assigned token. Metadata server administrators would map their local
elements to these DECs.
This touches on an issue that goes all the way back to our first
discussions at the DC4 workshop in Canberra, where we pictured that
local tokens in a local language would be paralleled by universal
global tokens: "For a Dublin Core in Thai, one would ideally like
to have a framework for defining two parallel sets of tokens for
qualifiers: a set of local tokens expressed in the Thai language
and alphabet, and a set of matching universal tokens for those
qualifiers that were shared with other Dublin Cores." [From:
http://purl.org/DC/groups/languages/wg-language-mr-19970303.htm]
The practical position that we later developed (motivated by concerns
of readability) acknowledged, but left unsolved, the question of how
the use of English words would scale to hundreds of concepts. We also
recognized, but did not answer, the question of how developers of schemas
customized for particular disciplinary or language communities would
retrospectively bring their local elements into alignment with a growing
set of internationally approved elements. I very much agree with Philip
that this is a good time to reconsider the possibilities.
In a posting to dc-general on 27 April, Bernard Eversberg points out
one problem:
>Isn't it quite curious that all those tags are formulated in natural
>language, as if to enable human readers to understand them. That's not
>the primary objective. No one is supposed to write these labels by hand
>either, it is all supposed to happen through the interpretation of
>software front ends. As that happens, human-readability becomes
>meaningless, a liability rather than an asset.
>Those natural language labels not only make metadata ever uglier and
>bulkier (you don't see the data any more among all the tagging), they can
>even, as G.H. observes, become unintelligible for robots. I repeat what I
>sais several times: Scrap all this, use abstract, well-defined notations
>like MARC.
...
>MARC is cryptic? It is not for human readers either. Intelligent front ends
>that already *exist* can conceal it from everybody. They can even, on
>the surface, put up labels like "DC.Creator.PersonalName" for you. But
>then store this as "100".
...
>And BTW: DC-Labels are not natural language. They are English. For us, this
>language is quite unnatural ;-) Numbers are neutral.
There are other multilingual systems that use numbers. For example,
I believe the interlingual index of the EuroWordNet project uses numbers
to stand for concepts that are shared by ontologies of Italian, English,
Dutch, and Spanish (e.g. "01461581" for "tiger cat; felis tigrina").
If we were to adopt numbers, I do not believe we should try to come up
with one master namespace. Rather, we should define multiple namespaces
for number sets, maintained by various communities for various purposes,
and specifically reference them from our metadata (as XML does for
RDF schemas). For example, the elements of Dublin Core are in theory
maintained only by the Dublin Core community, however they have been
given numbers and semantic glosses in other systems (e.g. Z="1097"
in http://www.gils.net/elements.html); to me, the multiplication of
(in effect) canonical sources for Dublin Core semantics raises larger
issues of control and versioning.
The ultimate goal of our discussion now should be to come up with a
clearly stated position with which we can replace Point 5 above.
Tom
P.S. I have renamed this thread "POINT 5: Tokens" to avoid confusion
as we reconsider other points of the position paper.
------------
Philip Coombs wrote:
>The Washington State Library is exploring the use of numeric tokens to
>allow semantic interoperability of metadata. We have monitored the
>discussions on the dc-international listserv. We are interested in your
>reaction to our concepts.
>
>Washington State has invested in the harvested-metatag process of
>building a searchable index database. We have adopted attribute terms
>from both Z39.50 and Dublin Core. This mixed set presents a challenge
>to assigning a single scheme for our metadata. We do not wish to
>establish our attribute set as a unique registered namespace.
>
>The Washington State model has been adopted by a growing number of
>states in the US. The common issue is of interoperability. "Will the
>WAGILS attributes work with Dublin Core or other registered sets?"
>
>Since the underlying semantic meaning of our terms is equivalent and can
>be mapped to other registered sets, we looked for a way the terms could
>be "internally mapped" in the metatags. That is when we discovered the
>good effort your workgroup had made using tokens.
>
>Our concern was the mapping of terms to other languages, especially
>those using other than ASCII characters. For example, in June the
>Washington State Library will be assisting the Russian Federation with
>their search services and must address Cyrillic metadata.
>
>We propose using a numerical token rather than a textual value. Here
>are some examples in HTML and XML:
>
><META NAME="dc.title" CONTENTS="The name of the object" Z="4">
>
><?XML version "1.0"?>
><!DOCTYPE report SYSTEM "report.dtd">
><report>
> <description>
> <title>The name of the object
> <Z>4</Z></title>
> </description>
> <text of the report>
> XXXXXXXXXXXXXX
> </text of the report>
></report>
>
>(This declaration could probably also be expressed using attributes
>under each metadata element)
>
>The registered scheme we use for the "Z" numbers is the bib-1 attribute
>set. The bib-1 scheme includes the Dublin Core set as well as a few
>others. Reference: http://www.gils.net/elements.html and
>http://lcweb.loc.gov/z3950/agency/defns/bib1.html.
>
>Several vendors have expressed their interest in supporting tokens for
>metadata creation (e.g., embedded HTML metatag / XML metadata),
>harvesting (e.g., spider parsing), and fielded searching (e.g., GUI
>search using local terms but internally mapped to the token). Because
>the tag numbers from Z39.50 are used, it opens up possibilities of HTTP
>interoperability with Z39.50 query and retrieval rules.
>
>Obviously, much work remains before it could be operational. Your
>comments and recommendations are greatly appreciated.
>
>
>Philip Coombs, Project Director
>GILS-IMLS Project
>Washington State Library
>[log in to unmask] 360.704.5279
_______________________________________________________________________________
Dr. Thomas Baker [log in to unmask]
GMD - German National Research Center for Information Technology GmbH
ERCIM - European Research Consortium for Informatics and Mathematics
DCML - Working Group on Dublin Core in Multiple Languages
http://purl.org/DC/groups/languages.htm
http://www.dlib.org/dlib/december98/12baker.html
http://www.ercim.org/publication/ws-proceedings/EU-NSF/metadata.html
Personal : c/o FES, GPO Box 2781, Bangkok 10501, Thailand
Work : c/o GMD, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
Home (11-12 hrs ahead of USA) : +66-2-300-3434
Fax ("for Tom Baker"), Bangkok : +66-2-246-7030, voice: +66-2-246-7013
Office at GMD (August 1999+) : +49-2241-14-2566
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|