On Thu, 27 Feb 1997, Martin J. Duerst wrote:
> Would you be happy if the above were expanded as follows:
>
> "The default language is Russian (just an example), but if authors
> are really keen on using something else, such as French, English,
> Chinese, German, Arabic, and so on, they should tag their documents
> appropriately."
Deliriously happy. As long as I know what the default is, I'm chuffed.
I'd be even more chuffed if the default was International English purely
from a practical perspective but if that is really going to ruffle too
many feathers I can live with any major language.
> > My point is that not having defaults will make the situation worse, not
> > better.
>
> Francois and I have tried to explain to you that in the past,
> in many examples, the following happened:
>
> - Default is established or implicitly assumed
> - Default is fine with a large community
> - Software not handling anything but the default is built
> - The rest of the world wants to get in quickly
> - The rest of the world prefers wrongly untagged information
> to correctly tagged, but not yet working stuff
> - This works locally because of common assumptions
> - The data producers think everything is fine because they
> don't see the problem
> - Hacks have to be introduced to allow the end user to guess
> the tagging
If the rest of the world is keen on using their languages and characer
sets, why aren't they writing software to handle it if the standards allow
it? Why is this? I think we can safely assume that people outside of the
US and Western Europe can write software and have machines to write it on
(Japan has quite a few I believe). Could it be that, shock horror, they
don't really care too much about using their local charsets and languages
and just want to make their information available to a much wider
audience? Our Japanese chum on other mailing lists and some European
researchers I've talked to tends to strengthen that view for me.
> If you can show examples where the establishment of a language
> and/or charset default has not led to the above problems, I
> would really be grateful. A general statement like "not having
> defaults will make the situation worse, not better" is not
> worth much without real examples of how it actually works.
My email user agent. By default it assumes that email is US-ASCII but if
it notices another charset in use it either displays it (in the case of
some of the ISO-8859-x charsets) or lets me know that the message is in
some other charset and that it might not display correctly. Without a
default charset, what character set should my MUA have used to display
messages with no MIME charset?
> "The default character encoding of the Dublin Core is XXXXX."
>
> Can you please tell me what this should apply to, in your eyes?
> Please be as specific as possible.
I'm losing you again here; the default character encodiing of Dublin Core
is going to depend upon the concrete representation in use. In the case
of HTML (which is what we've been discussing), I'd say UTF-8 to fit in
with the localization of HTML (I'm really going off the term
Internationalization in a big way now; we seem to be talking more and more
about letting systems be localized to be more useful/friendly to a local
audience than internationalized so that an international audience can
use them).
> In my eyes, what the document should say is the following:
>
> - Wherever DC metadata is embedded in other formats, the
> character encoding of the DC metadata is the character
> encoding of the enclosing format.
Yep.
> - When designing a DC metadata embedding or a format or interface
> for DC metadata only, care should be taken that the
> Universal Character Set of ISO 10646 can be fully represented
> in some way. [we might want to specify a standard way to
> encode ISO 10646 characters with some escape mechanism
> if the native format doesn't allow the representation
> of all characters]
Yep (so ISO10646 looks like a default charset here. Yippee at last
someone agrees with me!)
> - When designing a format or interface for DC metadata only,
> a single character encoding, preferably the UTF-8 or
> UTF-16 form of ISO 10646, should be choosen.
Yep.
> To have the character encoding unknown is clearly unacceptable.
At last; agreement. So that's the default charset of "unknown" discarded
and ISO-10646 as the default. Good; some progress.
> But for language, I think the only real solution is to say that
> "no tag" means "don't know".
So processing that relies on knowing the language can't use that metadata
and just has to junk it? Isn't that going to restrict the spread of such
software seeing as I think its highly likely that most people won't add
language qualifiers most of the time? So if the software doesn't have
much data to work on, few people will commercially develop that software
and so internationalisation is made more difficult.
> Most of the existing metadata
> in library databases and so on is not language tagged now.
> If we define a default of English or Klingon or whatever,
> these are the possible consequences:
>
> - The data will be put in DC form, and will be wrongly tagged,
> by a wrong English or whatever "default"
It would be wrongly tagged if people who are writing metadata in French
or German or whatever tag it appropriately. Which one would assume
that they will seeing as they're interested in providing localized data
to an internationalized system. Why is this statement such a problem when
the statement:
The data will be put in DC form, and will be wrongly tagged, by
a wrong ISO-10646 or whatever "default"
is not a problem (it isn't a problem because you've just said "To have the
character encoding unknown is clearly unacceptable")? If the resource was
actually in (say) ISO-8859-5 but the user was too lazy to tag it as such
or the software didn't allow it, then the output would also be stuffed up.
If the resource was in French but the user was too lazy to tag it as such
or the software didn't allow it, then the output would also be stuffed up.
> - The data can only be converted to correct DC metadata with
> enormous efforts, having a look at every item and
> deciding its language. Even US/UK libraries will
> have to do the work, because they also have foreign
> language works and English translations of
> foreign autors.
Er, I don't understand this one at all. If we have a default of
International English, the vast bulk of people providing metadata won't
have to decide on its language. Only those wishing to use a local
language will need to think about tagging it as such. Without a default
language, everyone will always have to tag languages and metadata with no
language tag will have to be discarded.
> - This valuable data will never make it into DC metadata,
> because nobody wants to do the work, and nobody
> wants to be incorrect.
So its too much work for people using a non-default language to tag it as
such if we have a default defined, but its not too much work to force
everyone to specify the language in use? That doesn't sound right to me
at all! What you're saying is that forcing everyone to do more work is
easier than forcing a smaller community to do more work.
> If you can seriously explain me why a default for language
> would be a good thing, and how the above problems could be
> avoided, I'm looking forward to your answer. But please
> don't just restate "unknown default languages just isn't
> acceptable".
A default language is a good thing because it means that software that
relies on knowing the input language will actually be able to
automatically process the metadata successfully. If the language is
unknown, it can't without human intervention and even then it might not be
successful (because I for one can't recognise what language most
non-English documents other than German are written in). Things like
autoomatic translation and speech synthesis are example of applications
that might well make use of this.
> As you can see above, there are very serious
> reasons for having an explicit default of "unknown" for
> languages (which is not exactly the same as an unknown
> default).
OK, I've got a question for you now: you're writing an indexing engine
that's working in a multilingual environment. You want to be able to let
people search in their local language so you want to be able to do some
simple translations of keywords to a variety of different natural
languages. You get some DC without a language qualifier out of a web
document (you can't force people to have a language qualifier and there's
already alot of DC out there with no language qualifier). What would _you_
do with it if there is no default language defined? The only option I can
see would be to discard it. Which is Not A Good Thing(tm) to my mind.
Tatty bye,
Jim'll
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU.
* I've found I now dream in Perl. More worryingly, I enjoy those dreams. *
|