On Wed, 19 Feb 1997, Misha Wolf wrote:
> At the very least, the Appendix needs to say something along the following
> lines (this is written quickly and may need refining):
>
> The three characters:
>
> " and & and >
>
> MUST NOT be included in an element value unless they are escaped using
> HTML entity names or numeric character references. These are:
>
> Character Entity name Numeric character reference
>
> " " "
> & & &
> > > >
That sounds groovy to me. I was talking to Dave Beckett the other day and
he suggested that I18N of DC metadata would be nudged in the right
direction if we assume that the default charset is ISO-8859-1 (or better
yet the full Unicode 2.0, though not much implements that at the moment)
with an encoding of UTF-8 (and say a default language of International
English). That would let us all use lots of funky character escapes like
the above in DC metadata and would fit in nicely with the development of
HTML (which is where lots of DC metadata is likely to appear after all).
For a basic but useful list of such character escapes, see
<URL:http://www.natural-innovations.com/boo/doc-charset.html>. More info
about Unicode is available from
<URL:http://www.cm.spyglass.com/unicode.html>
> Of course, an alternative wording, which would work just as well but would
> inconvenience more people would be:
>
> The three characters:
>
> " and & and >
>
> MUST NOT be included in an element value.
I don't think that's a good idea as I can imagine (nay, I've _seen_)
examples where people want to include just those characters.
> My previous mail also raised a problem related to the (subsequent) inclusion
> of a syntax for qualifiers. If this ends up being, say:
>
> Element : Relation
> Value : (Scheme=URN)(Type=ParentOf)http://www.oclc.org/
>
> then we have to deal with the problem of an element value commencing with a
> "(". The currently preferred approach is to escape this character using the
> URL mechanism of "%hh" where "hh" are the two hex digits of the octet
> representing the escaped character in ASCII. So, "(" would be encoded as
> "%28". Any leading "%" character would itself need escaping, as "%25".
I think we originally said that we'd leave a space between any bracketed
qualifier pairs and the real value, though Dave also suggest the %
escaping as well (see
<URL:http://www.roads.lut.ac.uk/lists/meta2/0132.html> and the ensuing
discussion). Looking back now, I think that ISO & escapes are far better,
partly because lots of people are used to them from their HTML editing and
also because it opens up the wacky world of extended, non-ASCII
characters.
As long as its valid HTML 2.0/3.2 and fairly reasonable I'm happy (broken
record eh folks?).
Tatty bye,
Jim'll
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU.
* I've found I now dream in Perl. More worryingly, I enjoy those dreams. *
|