I forwarded Jon's mail of 1997-02-23, regarding charset and language
defaults, to François Yergeau of Alis Technologies Inc., who is one of the
co-authors of RFC 2070 - "Internationalization of the Hypertext Markup
Language". François's reply follows.
Misha
---
À 19:15 23-02-97 +0000, Misha Wolf a écrit :
>Any thoughts on how to respond to Jon?
Point him to the current situation with HTTP/HTML.
In the early days, 8859-1 was the only charset around. Fine.
As soon as other languages appeared on the Web, other charsets were used
and chaos began. The answer was to declare 8859-1 as the default and force
the others to use a charset parameter in HTTP headers. Fine in principle,
as it allowed then existing software to continue to work, but in practice
server implementers (mostly Western) were happy with 8859-1 and never made
their servers produce the required charset parameter.
Today 8859-1 is still officially the default, but you get "octets from any
old charset in your [meta]data" and you don't have a clue to what this
charset is.
Lesson: don't specify a default charset, just force everyone to label data
correctly. Don't let part of the community get away with not labelling, it
will not work, especially if that part is those who produce the software.
Having defaults does not increase interoperability, it just encourages
laziness.
Yet specifying 10646 (in some specific encoding) may work, since it is not
in wide use today, and does not provide a free lunch to anyone. It is also
forward-looking, encourages a universal solution and helps with embedding
in HTML.
>So if we get some DC metadata turn up with no Charset qualifier, how
>exactly should software process this? If you're going to index this data
>and then make it searchable (possibily via a multilingual front end), how
>do you interpret what is effectively an octet stream in order to do
>matching? Is it from ISO-8859-1? One of the other ISO-8859-x char sets?
>ISO-10646? Big5? Etc, etc, etc.
Exactly the current situation with HTTP, a result of having declared 8859-1
the default. Don't.
>> >with an encoding of UTF-8 (and say a default language of International
>> >English).
>>
>> Where's that bar of soap? No default languages.
>
>Again my mouth is more than clean enough thank you. I think I've argued
>that point before and I'm sticking to it. I want to know what the default
>interpretation of the DC elements are so that any software I write can do
>sensible things with them.
Same argument, but this time there's not even a 10646 equivalent to serve
as a universal language. The only solution is no default, otherwise you
WILL get text in any language, unlabelled, which you will assume to be
English but won't be.
Regards,
--
François Yergeau <[log in to unmask]>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561
|