On Mon, 4 Oct 2004, Douglas Campbell wrote:
> Andy,
>
>>>> [log in to unmask] 2/09/04 21:33:19 >>>
> I took an action at the last meeting of the DCMI Usage Board to write
> up
> some guidelines for assigning identifiers to metadata terms.
>
> I've just re-read this and realised you included "controlled vocabulary
> terms" in the definition of metadata terms. Are you meaning every
> single term in a vocabulary must use a URI?
Yes, that is what I meant - though perhaps using the word 'must' w.r.t.
identifiers for vocabulary terms is a bit strong? (Yes, I've just seen
Pete's response and agree with what he says there).
> The DC Abstract Model differentiates between "syntax encoding schemes"
> (eg. dcterms:W3CDTF) and "vocabulary encoding schemes" (eg.
> dcterms:IMT). Though I wonder about schemes that are a mixture of
> these. You could argue LCSH has a set list of terms (like a vocabulary)
> but those terms are combinable in a multitude of ways (more like in a
> syntax). Isn't that also the case with ISO639-2 - where a language
> value is a syntactic combination of terms from two lists - language
> codes and country codes? This makes it hard to decide which
> vocabularies _can_ have a URI for every term.
Well... I agree that LCSH is an interesting example! :-)
In passing, it is worth noting that in theory one could define a URI for
every date that is representable using W3CDTF (an infinite list) even
though one couldn't actually enumerate all of them. In the case of both
LCSH and ISO639-2, the problem isn't that bad, since the list of
combinations isn't infinite in either case.
The more interesting questions though (to my mind at least) are:
- In the specific context of LCSH, to what extent does the subject
heading function as an identifier, to what extent does it function as a
label and to what extent does it function as a parsable string? I guess
that this is the fundamental question that you are asking above. And, to
be honest, I'm not sure I know what the answer is.
- More generally, how do we transition our use of controlled vocabularies
from the pre-Internet age to the Internet age?
My personal view is that some of the fundamental design strategies that
have grown up for managing controlled vocabularies in the pre-Internet age
no longer apply - particularly the issue of whether terms should be
assigned 'dumb' or 'intelligent' identifiers.
Here's part of a part-formed email that I sent to an internal UKOLN list
recently, which touches on some of these issues.
--- cut ---
Dewey (I'm using Dewey as an example) comes from a pre-Internet age. As
such, it was reasonable to represent the concepts in Dewey using a label
(the caption) and a non-URI identifier (the class number). Clearly they
couldn't use URIs as their identifiers because URIs didn't exist! More
importantly, it was also sensible to design some level of intelligence
about the concepts into the identifier such that some of the relationships
between any two concepts could be determined just by looking at their
identifiers. The reason that this was sensible was because there was no
readily available mechanism for sharing that intellignece between users of
the concepts (software and people), other than by encoding it into the
identifier itself.
In the Internet age, I am highly skeptical that this is still a sensible
design strategy. In the Internet age, intelligence about the concept
(metadata) can be made available to an application by separately
'resolving' the identifier for the concept in some way (e.g. by obtaining
some RDF about the concept from a terminology service of some kind). The
concept identifiers can thus be completely 'dumb' - i.e. they don't need
to have any parsable structure. In addition of course, the identifiers
should also be URIs - because these are the identifiers of the Internet.
This clean separation between the *identification* of the concept and the
*description* of the concept is highly beneficial in terms of the
persistence of the concept identifiers. For example, as our view of the
relationships between concepts changes (i.e. as our knowledge grows), we
no longer need to change the concept identifier (the class number) as I
think we often do currently.
If we lived in a world where concepts had 'dumb' URIs, labels and
separately available metadata (i.e. the semantic Web world!) there is no
conflict with the abstract model's use of 'value URIs' and 'value strings'
and no possible confusion with 'syntax encoding schemes' because a
parsable value string like '331.18ENG' *would never* be used.
--- cut ---
Unfortunately, the 'info' URI tends to promote the continued use of
'intelligent' URIs. This is partly because its original rationale was to
give us a way of encoding pre-Internet identifiers as URIs. But also
because the 'info' URI specification allows software to inspect an 'info
URI and say "ah ha, this is the identifier of a term in Dewey" - at least
as far as I understand it.
On balance, and in the current absence of any URIs for the subject
headings in LCSH (intelligent or otherwise), I think it is best to treat
the LCSH subject heading as a 'value string' and dcterms:LCSH as a
vocabulary 'encoding scheme URI' in terms of the abstract model. This
doesn't prevent software applications that have knowledge of LCSH from
parsing the subject heading as they see fit - but doesn't endorse
such activity either.
> I note that vocabulary terms encoded using a URI appear in the Abstract
> Model as a "value URI". I'm not sure if you can still have an
> associated "encoding scheme URI"? If not, does this mean you can't
> distinguish whether a value provided as a value URI is of a particular
> encoding scheme (URI) unless you happen to know that the value URI
> belongs to that encoding scheme?
This has already been answered, but to re-iterate... yes, according to the
abstract model you can have both a value URI and a vocabulary encoding
scheme URI in a statement.
Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/
|