JISCMail - DC-GENERAL Archives

On Thu, 27 Feb 1997, Jon Knight wrote:

> On Thu, 27 Feb 1997, Martin J. Duerst wrote:

> If the rest of the world is keen on using their languages and characer
> sets, why aren't they writing software to handle it if the standards allow
> it?

They are. But what we want is not Japanese software to handle
Japanese documents, Korean software to handle Korean documents,...
What we want is general software to handle all kinds of documents.

> Why is this?  I think we can safely assume that people outside of the
> US and Western Europe can write software and have machines to write it on
> (Japan has quite a few I believe).  Could it be that, shock horror, they
> don't really care too much about using their local charsets and languages
> and just want to make their information available to a much wider
> audience?  Our Japanese chum on other mailing lists and some European 
> researchers I've talked to tends to strengthen that view for me.

If you don't speak Japanese, how do you know. If you only spoke
Japanese, would you participate in an English mailing list?


> > If you can show examples where the establishment of a language
> > and/or charset default has not led to the above problems, I
> > would really be grateful. A general statement like "not having
> > defaults will make the situation worse, not better" is not
> > worth much without real examples of how it actually works.
> 
> My email user agent.  By default it assumes that email is US-ASCII but if
> it notices another charset in use it either displays it (in the case of
> some of the ISO-8859-x charsets) or lets me know that the message is in
> some other charset and that it might not display correctly.  Without a
> default charset, what character set should my MUA have used to display
> messages with no MIME charset?

Mine uses iso-2022-jp, because that includes ASCII, and because
I get many messages in Japanese that aren't MIME-tagged. The
practice of sending messages with all kinds of character
encodings significantly predates the definition and introduction
of the MIME standard.


> > "The default character encoding of the Dublin Core is XXXXX."
> > 
> > Can you please tell me what this should apply to, in your eyes?
> > Please be as specific as possible.
> 
> I'm losing you again here; the default character encodiing of Dublin Core
> is going to depend upon the concrete representation in use.

Very correct. So please stop asking for "the default" for charsets
(character encoding).

>In the case
> of HTML (which is what we've been discussing), I'd say UTF-8 to fit in
> with the localization of HTML

How or where did you get to the idea that UTF-8 is a default for HTML?
I happend to have been (and still being) strongly involved in HTML
i18n (see RFC 2070), and never have heard about it :-).


> (I'm really going off the term
> Internationalization in a big way now; we seem to be talking more and more
> about letting systems be localized to be more useful/friendly to a local 
> audience than internationalized so that an international audience can
> use them).

Internationalization is about making systems so that people in all
nations and across nations can use the systems the way they want,
in particular with their scripts and languages.


> > But for language, I think the only real solution is to say that
> > "no tag" means "don't know".
> 
> So processing that relies on knowing the language can't use that metadata
> and just has to junk it.  Isn't that going to restrict the spread of such
> software seeing as I think its highly likely that most people won't add
> language qualifiers most of the time?

If that's highly likely to you, what do you prefer:

- That with a default of Klingon, more than 99% of the items are incorrectly tagged.
- That with a default of English, something between 60% and 80% of the items
	are incorrectly tagged.
- That with a default of "Unknown", something below 1% of the items are
	incorrectly tagged.


> If the resource was
> actually in (say) ISO-8859-5 but the user was too lazy to tag it as such
> or the software didn't allow it, then the output would also be stuffed up.
> If the resource was in French but the user was  too lazy to tag it as such
> or the software didn't allow it, then the output would also be stuffed up.

There is a big difference between character encoding information and
language information. Please answer the following questions:

(1) What language do you think the following is in:
	1997-02-28:12:18:55.657
(2) What language do you think the following is in:
	Martin T. Heller
(3) Would you be able to answer the above questions if your system
	didn't know whether the mail I sent you was in ASCII or
	in EBCDIC?




> > - The data can only be converted to correct DC metadata with
> > 	enormous efforts, having a look at every item and
> > 	deciding its language. Even US/UK libraries will
> > 	have to do the work, because they also have foreign
> > 	language works and English translations of
> > 	foreign autors.
> 
> Er, I don't understand this one at all.  If we have a default of
> International English, the vast bulk of people providing metadata won't
> have to decide on its language.

Would you please care to go to your local community public library
and ask them whether they have books e.g. in French or German?
And whether their computer system knows which books are in
foreign languages? And whether their computer system knows
which foreign books actually have English titles, and which
English books have foreign (e.g. French) titles? And whether
their system knows which English books have foreign authors,...
And please then decide whether you would be able to write a
little program to convert their data to 100% correctly
language tagged metadata.


> Only those wishing to use a local
> language will need to think about tagging it as such.  Without a default
> language, everyone will always have to tag languages and metadata with no
> language tag will have to be discarded.

Metadata without language tag will not be discarded. It will be
searched, displayed, printed, read, and so on. It will be a tad
less useful that language-tagged metadata, but the difference
won't be much.


> > - This valuable data will never make it into DC metadata,
> > 	because nobody wants to do the work, and nobody
> > 	wants to be incorrect.
> 
> So its too much work for people using a non-default language to tag it as
> such if we have a default defined, but its not too much work to force
> everyone to specify the language in use?  That doesn't sound right to me
> at all!  What you're saying is that forcing everyone to do more work is
> easier than forcing a smaller community to do more work.

No. If those in the English community are really sure the only thing
they deal with is English, then it's peanuts to do the tagging
automatically. But if they aren't 100% sure, they better tag it
"unknown" than "English" to stay on the safe side.


> > If you can seriously explain me why a default for language
> > would be a good thing, and how the above problems could be
> > avoided, I'm looking forward to your answer. But please
> > don't just restate "unknown default languages just isn't
> > acceptable".
> 
> A default language is a good thing because it means that software that
> relies on knowing the input language will actually be able to
> automatically process the metadata successfully.  If the language is
> unknown, it can't without human intervention and even then it might not be
> successful (because I for one can't recognise what language most
> non-English documents other than German are written in).

Well, for long documents in a single language, it's fairly easy to
write programs with highly accurate heuristics. But as we are
dealing with metadata (titles, author's names,...), that's
not so easy. In some cases, that's even difficult for experts
knowing all the languages in question!


> > As you can see above, there are very serious
> > reasons for having an explicit default of "unknown" for
> > languages (which is not exactly the same as an unknown
> > default).
> 
> OK, I've got a question for you now: you're writing an indexing engine
> that's working in a multilingual environment.  You want to be able to let
> people search in their local language so you want to be able to do some
> simple translations of keywords to a variety of different natural
> languages.

You mean a query like "give me the documents in languages A, B, and C
that contain something about what is called X in English"?


> You get some DC without a language qualifier out of a web
> document (you can't force people to have a language qualifier and there's
> already alot of DC out there with no language qualifier). What would _you_
> do with it if there is no default language defined?  The only option I can
> see would be to discard it.  Which is Not A Good Thing(tm) to my mind. 

I would do the following: Translate X to languages A, B, and C
(assuming that's possible), then looking for these in document
titles, keyword lists,... of documents known to be in A, B, or
C, in titles, keyword lists,... known to be in A, B, and C,
and in titles, keyword lists,... of unknown language or of
documents of unknown language. I would get some wrong hits,
but would find all right hits. And because even for a very
wide range of languages written with the same script, the
overlap of words is not extremely high, the number of wrong
hits wouldn't be too high.

Regards,	Martin.