On Thu, 27 Feb 1997, Jon Knight wrote:
> On Thu, 27 Feb 1997, Martin J. Duerst wrote:
> > Jon - How many times have you had a look at Japanese or Korean documents,
> > or anything else outside ISO-8859-1? If you had, you would know very well
> > that if a document isn't tagged, you *DON'T KNOW* that it's iso-8859-1!
>
> So it should be tagged to tell us that it is in Big-5 or whatever.
Jon - The recent agreement on the HTTP list has been that everyting,
including iso-8859-1, should be tagged. That oppinion didn't come
from some i18n enthusiast such as myself, but from a long-standing
HTTP specialist usually rather critical to i18n enthusiasm. If you
want, I can forward that contribution to you.
> > > I think that the lack of non-ISO-8859-1/English documents is just due to
> > > either lack of demand, difficulty creating them or tools to handle them.
> >
> > There is absolutely no lack of such documents. The only lack that we
> > have is that most of these documents aren't correctly tagged, i.e.
> > aren't tagged at all.
>
> Well I certainly don't come across them very often. Maybe I just use the
> wrong search terms or something.
If you don't know any of these languages, how would you use the correct
search terms? Cross-lingual information retrieval unfortunately is still
in its infancy.
> > There are many such "worldwide interest" sites. But there are also many
> > sites that are more local. I don't have any estimates, but I think you
> > are very seriously underestimating their importance.
>
> I don't think I am, but I do think that you're underestimating the
> importance of International English as the de facto default for
> international trade and scholarly communication.
I definitely don't underestimate the importance of International
English! But the web can be used, and actually IS used, for a
lot of regional and local communication. Just because you can
communicate with the whole world on the Internet doesn't mean
that you have to :-).
> > You don't know how to process metadata for which you don't know
> > the character encoding. But as I explained, the character encoding
> > is usually given by the container that carries the metadate.
>
> But you do know the character encoding, charset and language if you've got
> defaults. That's the whole point of my argument.
You only do know these things if you have defaults AND the
data is correctly labeled.
> > Of course, without knowing the language of the name "Knight", you
> > may have difficulties for operations such as translating the name
> > to another language, but I seriously doubt whether this is such an
> > important operation.
>
> Maybe not at the moment but I'd rather not constrain future developments
> with bad choices now. And automatic translation is a development that
> would _really_ help make the WWW much more friendly for everyone (you can
> write something in your native tongue and I can read it mine; that's real
> I18N to my mind). As you've agreed that not having a known default makes
> that one operation much more difficult, hopefully we can now agree that we
> do need a default (in fact, didn't you suggest wording to just that effect
> in a previous email? Violent agreement again.... :-) ).
I am definitely not against language tagging, or against making operations
such as the above. But lets have a serious look at currently existing
metadata and the metadata that is produced in the near future. What
are your estimates as to percentages of the following:
- Metadata is known to be in English with 100% certainty.
- Metadata is known to be in French with 100% certainty.
- and so on for other languages ...
- Metadata is in a language-independent form, so that language doesn't apply.
- Metadata is in a form where language applies, but more than one
language may apply.
- Metadata is in a certain language, but this language cannot be
decided with 100% certainty without manual intervention.
My estimate is that the last three points cover at least 99%.
This is why I propose the default language "unknown/not applicable".
Regards, Martin.
|