JISCMail - DC-GENERAL Archives

On Thu, 20 Feb 1997, Martin J. Duerst wrote:
> [Misha's stuff about using &lt;, &amp; and &gt; deleted]
>
> It does not have to be so explicit. It is enough to say that if
> the text is put in HTML into attribute values, the neccessary
> conventions have to be observed for RCDATA.

I think it _does_ have to be that explicit as we want to encourage people
to use Dublin Core and not say, "huh?" when they read the I-D/RFC/whatever
and find it starts talking about SGML stuff like RCDATA.  Lots of people
know about the character entities because they're used to using them in
HTML documents.  I would guess that not so many have even heard of RCDATA,
let alone know what it is.
 
> I would strongly advise to fully think over character encoding and
> international issues for DC metadata at this stage. Nudging can turn
> out lucky or not. In the three representations discussed
> (sorry, I forgot their names, I in the following call them
> Abstract, Intermediate, and HTML), the middle one is particularly
> unclear.

It is for me as well, mainly because I can only see two stages for DC; the
abstract concepts and the concrete representations.
 
> For the abstract, logical representation, which does not
> exist inside computers, it's okay to assume any characters can be
> used.
 
Yep, agreed.

> For HTML, the issue is also very clear (at least to me :-).

Ditto for me; I'd be more than happy with ISO10646/Unicode 2.0 (though
will a certain Japanese gentleman whose currently engaging in a very
entertaining flamefest on the URI lists want this to be in ASCII or some
other ISO charset?? :-) :-) :-) ).

> The big question is how to treat these issues on the Intermediate
> level.

Right this is where I lose you Martin; I don't know what this
"Intermediate level" is.  To my mind we an abstract form for a set of DC
metadata (the "in my head" form) and lots of concrete representations (the
"on my paper, in my computer or on my wire" form) of which embedding in
HTML is just one. Each concrete representation should go for maximum
encoding power within the constraints of its particular syntax.  Writing
DC on paper _is_ a concrete representation in my mind. 

> What are the other places (except HTML) where it would go?

SGML documents in other DTDs (with some designed for carrying DC), XML,
PostScript, LaTeX, DVI, MS-Word documents (OK, maybe not MS-Word documents
- they'll use BillyCore :-) ), graphics files, IAFA templates, WHOIS++
templates, X.500 directories, dead trees, etc, etc.  You name it.  Formats
without end, let 1000 flowers bloom and all that.  But getting DC into
HTML documents is the most important in the near term for "saving the
world" (IMHO anyway).

> Would it exist all on and by its own? 

Sometimes yes, sometimes no.  I think each new concrete representation
should have its own specification document.  Some data formats just won't
be able to handle embedding any metadata within them (legacy and
proprietry formats like MPX files for example).  The reason that the
"embedding DC in HTML 2.0/3.2 concrete representation" is getting so much
attention and often being lumped with the specification of the semantics
of the abstract DC elements is that it is the concrete representation that
we need badly _now_.  It just happens to be around at the same time that
we're doing the abstract bits of DC (deciding what the element names are,
what qualifiers we need to allow more precise interpretation of element
value semantics, etc). Its also quite a good example of how to squeeze DC
metadata inside an existing data format and so it makes sense to have it
in the initial (set of) document(s).

> Would there be some software parts (libraries,...) that
> would have to work on this level?

Yep; I'm already working on a Perl module that will suck the DC metadata
out of HTML documents and put it in a nice data structure.  More on that
soon.  I assume others are doing lots of similar stuff.  I might well do
DC in LaTeX, DVI and/or PostScript at some point as there's a lot of that
lying about on the web and I use it.

> > (and say a default language of International English).
> 
> I know a lot of people that would prefer this to be left unspecified.
> IE is a de-facto default, it doesn't need any more encouraging.

Well seeing as we agree that it is already the defacto default, it can't
really hurt to document the fact to avoid any possible future confusion. 
Knowing what language something is in could really help out in the future
with automated translation, speech synthesis, etc.  The alternative is to
have the user agent decide what the default language is for metadata that
doesn't have an explicit language tag using the locale, user preferences
or context analysis (as suggested in p8 of RFC2070 for HTML I18N).

If the majority of the unmarked pages are going to be International
English (because it is the defacto standard) anyway that means that with
the alternatives: 

a) It will be spoken/translated/etc incorrectly if the locale is not one
   where English is the default,
or
b) Will require the user to intervene continually (probably with most of
them leaving it on International English anyway to get the majority of
the metadata handled correctly)
or
c) Have some damn hot context analysis software (which would mean that
the LANGUAGE qualifier in DC and the LANG attribute in I18N-HTML would be 
superfluous as it could already distinguish what language any particular
bit of text was written in).

Having a known default of International English just seems to be the
logical, sensible course to me.
 
> > I think we originally said that we'd leave a space between any bracketed
> > qualifier pairs and the real value, though Dave also suggest the %
> > escaping as well (see 
> > <URL:http://www.roads.lut.ac.uk/lists/meta2/0132.html> and the ensuing
> > discussion).  Looking back now, I think that ISO & escapes are far better,
> > partly because lots of people are used to them from their HTML editing and
> > also because it opens up the wacky world of extended, non-ASCII
> > characters.
> 
> I don't fully understand what you mean, but if you mean that &#40; should
> be used to escape the "(", this would not be a good idea, because in a
> nicely structured HTML/SGML implementation, the parser will replace that
> with "(" before you will have a possibility to have a look at it.

What I meant was that we need some way to differentiate between a bracket
that surrounds a qualifier name-value pair and a bracket that is really
part of the element data.  I thought that using the ISO character entity
for the latter would get round this nicely and fit in with the desire
for non-ASCII characters but if you're saying that SGML parsers will
change the &40; to a bracket before the processing software gets a chance
to extract the metadata then that's that idea blown.  That leaves URL
style % escaping _or_ adding a leading space in front of the actual
element value.  I like the latter as I think its easier to read but I'm
easy.  This is just a gunky syntax decision; vital but not worth arguing
over too much.

Tatty bye,

Jim'll

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND.  LE11 3TU.
* I've found I now dream in Perl.  More worryingly, I enjoy those dreams. *