On Wed, 19 Feb 1997, Jon Knight wrote:
> On Wed, 19 Feb 1997, Misha Wolf wrote:
> > At the very least, the Appendix needs to say something along the following
> > lines (this is written quickly and may need refining):
> >
> > The three characters:
> >
> > " and & and >
> >
> > MUST NOT be included in an element value unless they are escaped using
> > HTML entity names or numeric character references. These are:
> >
> > Character Entity name Numeric character reference
> >
> > " " "
> > & & &
> > > > >
It does not have to be so explicit. It is enough to say that if
the text is put in HTML into attribute values, the neccessary
conventions have to be observed for RCDATA.
> That sounds groovy to me. I was talking to Dave Beckett the other day and
> he suggested that I18N of DC metadata would be nudged in the right
> direction if we assume that the default charset is ISO-8859-1 (or better
> yet the full Unicode 2.0, though not much implements that at the moment)
> with an encoding of UTF-8
I would strongly advise to fully think over character encoding and
international issues for DC metadata at this stage. Nudging can turn
out lucky or not. In the three representations discussed
(sorry, I forgot their names, I in the following call them
Abstract, Intermediate, and HTML), the middle one is particularly
unclear.
For the abstract, logical representation, which does not
exist inside computers, it's okay to assume any characters can be
used.
For HTML, the issue is also very clear (at least to me :-).
Except for the %-escaping of the "(" and the "%" itself, due
to the syntax choosen, and the escaping of ", &, and
for some legacy browsers >, taken care by SGML/HTML mechanisms,
the native encoding of the HTML document should be used.
Each HTML document, on the wire or stored in a file, has its
character encoding (denoted with a MIME "charset" parameter).
Also, due to the specification of ISO 10646 as a "document
character encoding" in the SGML sense, characters that cannot
be encoded directly in the native encoding can be denoted
by &#nnnn; (a numerical character reference, the nnnn is decimal!),
which is interpreted (as envisioned in HTML 2.0 (RFC 1866) and
specified in HTML i18n (RFC 2070), being integrated in the next
release of W3C HTML, codenamed Cougar) in terms of ISO 10646
(aka Unicode). Thus any HTML document, in whatever native encoding,
can contain the full set of UNicode characters. What is more, by
relying mainly on native encoding, the document can be read and
edited with the usual rawtext editors and is stable under transcoding.
The big question is how to treat these issues on the Intermediate
level. In terms of character encoding(s), is the intermediate level
to be defined in abstract terms (usable on paper) or with a concrete
encoding in terms of bits and bytes? This should be made clear. In
the later case, it should be clearly defined. To find a solution,
the following questions may help: What are the other places
(except HTML) where it would go? Would it exist all on and by
its own? Would there be some software parts (libraries,...) that
would have to work on this level?
At this level, specifying a default of 8859-1 is definitely a
bad idea, as it uses up all 8 bits and does not leave space for
extensions.
> (and say a default language of International English).
I know a lot of people that would prefer this to be left unspecified.
IE is a de-facto default, it doesn't need any more encouraging.
> That would let us all use lots of funky character escapes like
> the above in DC metadata and would fit in nicely with the development of
> HTML (which is where lots of DC metadata is likely to appear after all).
> For a basic but useful list of such character escapes, see
> <URL:http://www.natural-innovations.com/boo/doc-charset.html>. More info
> about Unicode is available from
> <URL:http://www.cm.spyglass.com/unicode.html>
Please use http://www.unicode.org to stay stable.
> I think we originally said that we'd leave a space between any bracketed
> qualifier pairs and the real value, though Dave also suggest the %
> escaping as well (see
> <URL:http://www.roads.lut.ac.uk/lists/meta2/0132.html> and the ensuing
> discussion). Looking back now, I think that ISO & escapes are far better,
> partly because lots of people are used to them from their HTML editing and
> also because it opens up the wacky world of extended, non-ASCII
> characters.
I don't fully understand what you mean, but if you mean that ( should
be used to escape the "(", this would not be a good idea, because in a
nicely structured HTML/SGML implementation, the parser will replace that
with "(" before you will have a possibility to have a look at it.
Regards, Martin.
----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: [log in to unmask]
----
|