JISCMail - DC-ARCHITECTURE Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
DC-ARCHITECTURE Archives

DC-ARCHITECTURE@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		DC-ARCHITECTURE Home
		DC-ARCHITECTURE July 2006
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Comments on DC-in-XML
From:
Pete Johnston <[log in to unmask]>
Reply-To:
DCMI Architecture Group <[log in to unmask]>
Date:
Thu, 13 Jul 2006 18:15:22 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (326 lines)
Hi Thomas,

> only now I found the time to look closer at the suggestions 
> and the discussions around it.
> 
> Overall I found the DC-XML paper very clear and useful, I 
> think I could implement this syntax fairly easily. I still 
> have some questions and remarks, though:

Thanks for the comments!
 
> 1. Like Ann, I find the introduction of "vocabEncSchemeURI" 
> and "vocabEncSchemeQName" not convincing. Even considering 
> Pete's remarks from June 12, I think one could get away with 
> something like "use the full URI if any ambiguity could 
> arise". Otherwise I find it pretty odd to introduce two 
> elements which differ only in the encoding used.

I stick to my position that - if we are going to support the convention
of abbreviating URIs as XML QNames (or as some other qualified name
form) we must be clear about whether a piece of XML content is to be
interpreted as a QName (or as some other qualified name form) or as a
URI. We can not have a format where we have a single attribute, and an
application is left to guess whether the attribute value is to be
interpreted one way or the other. As human readers, we are accustomed to
reading a string and, based on our expectations about the form of the
string in some context, deciding that it is a URI or it is a QName, but
an application needs the information explicitly.

Because there is an overlap between the "lexical space" of the anyURI
and Qname datatypes, an application can not obtain that information from
the string itself.

In my reply to Ann, I gave the example earlier of a value with no
prefix. e.g.

< .... dcx:someAtt="name" >

How does an application decide whether the value of that attribute is to
be interpreted as 

(a) a relative URI to be resolved relative to the base URI in scope; or

(b) an unprefixed XML QName which would map to an expanded name using
the namespace declaration for the default name space - and from that to
a URI if the DC-XML format specified that the mapping applied for that
QName) 

The two interpretations would result in completely different URIs, but -
unless we specify a single datatype for the attribute value - the
application can't know which to apply. 

When I made that reply, I struggled to come up with a real example of a
URI scheme where the absolute URI corresponded to an XML QName, but
consider

< .... dcx:someAtt="news:comp.infosystems.www.servers.unix" > 

How does an application decide whether the value of that attribute is to
be interpreted as 

(a) a URI, using the news URI scheme

(b) an XML QName using the prefix "news" and the local part
"comp.infosystems.www.servers.unix" 

Again without additional explicit information about the datatype, the
application can't know how to interpret that string.

[Aside: I'm still have some doubts whether such an abbreviation
convention for URIs is _necessary_ in DC-XML, and I think we could get
by fine with representing URIs as URIs (and using features of XML like
xml:base and XML entity references where necessary). But having said
that, I'm conscious that historically, syntaxes for DC metadata have
used such abbreviations and human readers/writers of DC metadata have
become accustomed to using QName (or other qualified name) forms for
URIs. So on that basis, I guess we should try to offer such features in
this syntax.]

> 2. I am wondering about the difference between attributes and 
> values. In my understanding, an attribute serves to interpret 
> or understand the meaning of the value, like an encoding 
> scheme or a language tag.

That is a convention sometimes used in the design of XML formats, but
there is nothing in the XML specification to support that, and it is not
followed in all XML formats. 

In the context of "document-oriented" XML formats, yes, there is often a
convention of using attribute values for data which is considered to be
"not part of the document content", particularly where the format is
being used as a "markup language", in the classic sense, i.e. to
"annotate" some pre-existing text. And e.g. in HTML that has extended to
the rule of thumb that the main content - the text the user sees
displayed - goes in element content, and attribute values are reserved
for some sort of "qualifying data" that doesn't get displayed, but
conditions the processing of the element content.

But as far as the XML InfoSet is concerned - and I really think that is
how we need to think about XML documents, rather than as streams of tags
and angle brackets (ideally we would write the DC-XML spec in terms of
the XML InfoSet, I think) - XML elements and XML attributes are just
nodes in a tree information structure. There is no fixed "semantic"
relationship betweeen an attribute value and the content of the parent
XML element. 

In terms of representing a data structure, XML makes no "semantic"
distinction between

<dog>
<name>Rover</name>
<colour>Black</colour>
</dog>

Or

<dog name="Rover" colour="Black"/>

Or

<abcxyz type="dog" name="Rover" colour="Black"/>

So the sort of "rules of thumb" that might make sense for
"document-oriented formats are much harder to maintain for
"data-oriented" XML formats, I think. Essentially wherever a format
designer chose to use an XML attribute, an XML child element could have
been used. The reverse is not true, because attributes of the same name
can not be repeated on a single XML element and because there is no
ordering of XML attributes (and obviously attribute values are "atomic"
and if the child element itself has child elements that sub-tree
structure can't be captured in an attribute value).

(Another factor that may be considered here is what capabilities are
available in different XML schema languages and/or query languages in
terms of what those languages allow you to say about elements and
attributes.) 

> I am unhappy if the actual property 
> is put into the attribute - but admittedly don't know any 
> rules against it.

OK, noted. And I am conscious that such a convention would be a change
from previous XML formats for representing DC metadata, but I think
there is also an argument for adopting greater consistency in
representing the different "classes" of URI in a DC metadata description
set. 
 
> This refers in particular to the use of "dcx:valueURI" as 
> opposed to "dcx:valueString": both conveying the same 
> meaning, one in an attribute, the other in a tag.

I would have been quite happy to make dcx:valueURI a child XML element
of the Statement Element rather than an attribute of the Statement
Element. The reasons for choosing to represent value URIs and value
strings in different ways were based on the different constraints in the
"structural model" of a DC metadata description set specified by the
DCMI Abstract Model (not by the DC-XML format). 

i.e. According to the DCAM:

(a) a single statement can have only a single value URI (whereas it can
have multiple value strings);
(b) each value string can be associated with a language tag or a syntax
encoding scheme URI

So these constraints on the "abstract information structure" meant that
it was _possible_ to represent the value URI as an attribute of the
Statement Element (no need to repeat, no sub-structure), whereas value
strings could not be represented as attributes of a single XML element
(they have to be repeatable and they do have sub-structure (in the sense
that the string may be associated with a lang tag or a SES URI)).

But it isn't _necessary_ to represent the value URI as an attribute of
the Statement Element i.e. we could equally well choose to use

   <dc:publisher>
       <dcx:valueURI>http://example.org/agents/DCMI</dcx:valueURI>
       <dcx:valueString>Dublin Core Metadata
Initiative</dcx:valueString>
       <dcx:valueString>DCMI</dcx:valueString>
   </dc:publisher>

i.e. it would be fine to represent both value URIs and value strings as
child elements of the Statement Element. 

And we could even extend that to property URIs

   <dcx:statement>
 
<dcx:propertyURI>http://purl.org/dc/elements/1.1/publisher</dcx:property
URI>
       <dcx:valueURI>http://example.org/agents/DCMI</dcx:valueURI>
       <dcx:valueString>Dublin Core Metadata
Initiative</dcx:valueString>
       <dcx:valueString>DCMI</dcx:valueString>
   </dc:statement>


> The 
> attribute approach has some additional disadvantages if 
> multiple values occur: do you repeat the attribute or the element?
> For example:
>   <dc:publisher dcx:valueURI="http://example.org/agents/DCMI">
>       <dcx:valueString>Dublin Core Metadata 
> Initiative</dcx:valueString>
>   </dc:publisher>
> If I want to give
> 	dcx:valueURI="http://dublincore.org/"
> as an additional value, where do I put it? Or mustn't I?

No, you mustn't. ;-)

A single statement has only one value URI. That is a constraint
specified by the DCAM - not introduced by DC-XML. So even if we moved to
a child element approach in DC-XML (as above), the format would still
only allow one dcx:valueURI child element, but multiple dcx:valueURI
child elements. (Arguably that's possibly a reason for sticking with
dcx:valueURI as an attribute - XML has that constraint built-in for
attributes, if you like. For child elements you have to put it in a
schema or elsewhere)

You could make a second statement using the dc:publisher property and
the value URI http://dublincore.org/ . 

That in itself would not establish that the URI
http://example.org/agents/DCMI and the URI http://dublincore.org/ both
identified the same agent.

Alternatively you could create a "related description" of the agent and
specify there that both were identifiers for the same agent.
 
> Furthermore, if I want to express that some valueURI is to be 
> interpreted according to a particular encoding scheme, like
> 	dcx:valueURI="http://example.org/standards/DDC/500"
> 	dcx:vocabEncSchemeURI="http://purl.org/dc/terms/DDC"
> I end up with two attributes, one referring to the other. 
> This can become quite ambiguous. Or is this ruled out and 
> supposed to be encoded (somehow?) in the valueURI?

I don't see any ambiguity here ;-)

That XML structure has to be interpreted in terms of the DC-XML
document, and in terms of the DCMI Abstract Model on which the DC-XML
document is based.  

The DC-XML document tells me how to interpret a DC-XML document as a "DC
description set": it tells me to interpret those XML attributes as
providing the "value URI" and the "vocabulary encoding scheme URI" for a
single "statement" (in which the "property URI" is obtained from the
name of the XML element).

The DC-XML document doesn't tell me what a "value URI" or a "vocabulary
encoding scheme URI" or a "statement" is, or "means". That's the job of
the DCAM document. The DCAM document tells me what those constructs
"say" about things in the world i.e. that the statement expresses a
relationship between two resources, that the value URI identifies one of
those resources and that the vocabulary encoding scheme URI tells me
about the type of the value resource.
 
> Similar problems arise when I want to give additional 
> information for a binaryRepresentation 
> <dcx:binaryRepresentation 
> dcx:representationURI="http://example.org/imgs/img.png" /> , 
> like a MIME type. Where to put it?
> The paper states "vocabulary encoding scheme URI ... is 
> represented as the value of an XML attribute of the Statement 
> Element", but this again would lead to attributes referring 
> to one another.
> I would like something like
> 
> <dcx:binaryRepresentation>
>   <dcx:representationURI dcx:vocabEncSchemeQName="MIME:image/png">
>     http://example.org/imgs/img.png"
>   </dcx:representationURI>
> </dcx:binaryRepresentation>

According to the DCAM, a vocabulary encoding scheme URI identifies the
type of the value, so it would not be used to provide the MIME type for
a rich representation.

The DCAM does not currently support the notion that a rich
representation should be associated with a MIME type. I have argued that
that is probbaly an omission in the DCAM and that we should consider
amending the DCAM to include it. See

http://dublincore.org/architecturewiki/AMIssues

In an earlier draft of DC-XML, I included a construct to do exactly
this, but I removed it from the draft that was circulated because it had
no mapping to a DCAM construct. 

So I think really this is an issue for the DCAM, rather than for this
format. Essentially the DCAM leaves the specification of a MIME type for
a rich representation outside the scope of a DC metadata description
set. 
 
> (Actually, the simple juxtaposition in DC-Text may also 
> become ambiguous. I would prefer something like
>     Statement (
>       PropertyURI ( dc:subject )
>       ValueString ( "Information technology"
>         VocabularyEncodingSchemeURI ( dcterms:LCSH )
> 	)
>     )
> over
>     Statement (
>       PropertyURI ( dc:subject )
>       VocabularyEncodingSchemeURI ( dcterms:LCSH )
>       ValueString ( "Information technology")
>     )
> Sorry for the overload of parentheses!)

Ah, no. Your example here associates the Vocabulary Encoding Scheme URI
with a single Value String. 

But in the DCAM, the Vocabulary Encoding Scheme URI  is _not_ associated
with a single Value String. It is associated with the Statement as a
whole, and it provides the type of the Value: it does not provide an
interpretation for any particular Value String.

> Anyway, thanks for the good work!

Thanks! ;-)

Pete
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options