On Thu, 20 Feb 1997, Martin J. Duerst wrote: > [Misha's stuff about using <, & and > deleted] > > It does not have to be so explicit. It is enough to say that if > the text is put in HTML into attribute values, the neccessary > conventions have to be observed for RCDATA. I think it _does_ have to be that explicit as we want to encourage people to use Dublin Core and not say, "huh?" when they read the I-D/RFC/whatever and find it starts talking about SGML stuff like RCDATA. Lots of people know about the character entities because they're used to using them in HTML documents. I would guess that not so many have even heard of RCDATA, let alone know what it is. > I would strongly advise to fully think over character encoding and > international issues for DC metadata at this stage. Nudging can turn > out lucky or not. In the three representations discussed > (sorry, I forgot their names, I in the following call them > Abstract, Intermediate, and HTML), the middle one is particularly > unclear. It is for me as well, mainly because I can only see two stages for DC; the abstract concepts and the concrete representations. > For the abstract, logical representation, which does not > exist inside computers, it's okay to assume any characters can be > used. Yep, agreed. > For HTML, the issue is also very clear (at least to me :-). Ditto for me; I'd be more than happy with ISO10646/Unicode 2.0 (though will a certain Japanese gentleman whose currently engaging in a very entertaining flamefest on the URI lists want this to be in ASCII or some other ISO charset?? :-) :-) :-) ). > The big question is how to treat these issues on the Intermediate > level. Right this is where I lose you Martin; I don't know what this "Intermediate level" is. To my mind we an abstract form for a set of DC metadata (the "in my head" form) and lots of concrete representations (the "on my paper, in my computer or on my wire" form) of which embedding in HTML is just one. Each concrete representation should go for maximum encoding power within the constraints of its particular syntax. Writing DC on paper _is_ a concrete representation in my mind. > What are the other places (except HTML) where it would go? SGML documents in other DTDs (with some designed for carrying DC), XML, PostScript, LaTeX, DVI, MS-Word documents (OK, maybe not MS-Word documents - they'll use BillyCore :-) ), graphics files, IAFA templates, WHOIS++ templates, X.500 directories, dead trees, etc, etc. You name it. Formats without end, let 1000 flowers bloom and all that. But getting DC into HTML documents is the most important in the near term for "saving the world" (IMHO anyway). > Would it exist all on and by its own? Sometimes yes, sometimes no. I think each new concrete representation should have its own specification document. Some data formats just won't be able to handle embedding any metadata within them (legacy and proprietry formats like MPX files for example). The reason that the "embedding DC in HTML 2.0/3.2 concrete representation" is getting so much attention and often being lumped with the specification of the semantics of the abstract DC elements is that it is the concrete representation that we need badly _now_. It just happens to be around at the same time that we're doing the abstract bits of DC (deciding what the element names are, what qualifiers we need to allow more precise interpretation of element value semantics, etc). Its also quite a good example of how to squeeze DC metadata inside an existing data format and so it makes sense to have it in the initial (set of) document(s). > Would there be some software parts (libraries,...) that > would have to work on this level? Yep; I'm already working on a Perl module that will suck the DC metadata out of HTML documents and put it in a nice data structure. More on that soon. I assume others are doing lots of similar stuff. I might well do DC in LaTeX, DVI and/or PostScript at some point as there's a lot of that lying about on the web and I use it. > > (and say a default language of International English). > > I know a lot of people that would prefer this to be left unspecified. > IE is a de-facto default, it doesn't need any more encouraging. Well seeing as we agree that it is already the defacto default, it can't really hurt to document the fact to avoid any possible future confusion. Knowing what language something is in could really help out in the future with automated translation, speech synthesis, etc. The alternative is to have the user agent decide what the default language is for metadata that doesn't have an explicit language tag using the locale, user preferences or context analysis (as suggested in p8 of RFC2070 for HTML I18N). If the majority of the unmarked pages are going to be International English (because it is the defacto standard) anyway that means that with the alternatives: a) It will be spoken/translated/etc incorrectly if the locale is not one where English is the default, or b) Will require the user to intervene continually (probably with most of them leaving it on International English anyway to get the majority of the metadata handled correctly) or c) Have some damn hot context analysis software (which would mean that the LANGUAGE qualifier in DC and the LANG attribute in I18N-HTML would be superfluous as it could already distinguish what language any particular bit of text was written in). Having a known default of International English just seems to be the logical, sensible course to me. > > I think we originally said that we'd leave a space between any bracketed > > qualifier pairs and the real value, though Dave also suggest the % > > escaping as well (see > > <URL:http://www.roads.lut.ac.uk/lists/meta2/0132.html> and the ensuing > > discussion). Looking back now, I think that ISO & escapes are far better, > > partly because lots of people are used to them from their HTML editing and > > also because it opens up the wacky world of extended, non-ASCII > > characters. > > I don't fully understand what you mean, but if you mean that ( should > be used to escape the "(", this would not be a good idea, because in a > nicely structured HTML/SGML implementation, the parser will replace that > with "(" before you will have a possibility to have a look at it. What I meant was that we need some way to differentiate between a bracket that surrounds a qualifier name-value pair and a bracket that is really part of the element data. I thought that using the ISO character entity for the latter would get round this nicely and fit in with the desire for non-ASCII characters but if you're saying that SGML parsers will change the &40; to a bracket before the processing software gets a chance to extract the metadata then that's that idea blown. That leaves URL style % escaping _or_ adding a leading space in front of the actual element value. I like the latter as I think its easier to read but I'm easy. This is just a gunky syntax decision; vital but not worth arguing over too much. Tatty bye, Jim'll -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU. * I've found I now dream in Perl. More worryingly, I enjoy those dreams. *