On Fri, 28 Feb 1997, Martin J. Duerst wrote:
> > Why is this? I think we can safely assume that people outside of the
> > US and Western Europe can write software and have machines to write it on
> > (Japan has quite a few I believe). Could it be that, shock horror, they
> > don't really care too much about using their local charsets and languages
> > and just want to make their information available to a much wider
> > audience? Our Japanese chum on other mailing lists and some European
> > researchers I've talked to tends to strengthen that view for me.
>
> If you don't speak Japanese, how do you know.
Because people occasionally spam mailing lists I'm on with Romanized
versions of other (non-English, non-Latin) languages. Such as the Turkish
one that appeared on the IETF list recently. And I do speak to European
researchers (as part of our EC projects) all of whom are keen on providing
information in English and not so keen on doing it in their native
languages. This was a surprise for me at first as I thought non-native
English speakers would want resources in their native character sets but
that doesn't seem to be the case.
> If you only spoke
> Japanese, would you participate in an English mailing list?
No. And I only speak English so I don't participate in Japanese, German,
Celtic or any other language's mailing list, even though I know that
they're out there and occasionally come in to accidental contact with
them (forwarded emails or the recent Turkish mailing list spam of the
main IETF mailing list for example). Your point being?
> > I'm losing you again here; the default character encodiing of Dublin Core
> > is going to depend upon the concrete representation in use.
>
> Very correct. So please stop asking for "the default" for charsets
> (character encoding).
Encoding != charset. Or so I've been told by people more into I18N than I
am. Are you telling me that they're wrong.
> >In the case
> > of HTML (which is what we've been discussing), I'd say UTF-8 to fit in
> > with the localization of HTML
>
> How or where did you get to the idea that UTF-8 is a default for HTML?
> I happend to have been (and still being) strongly involved in HTML
> i18n (see RFC 2070), and never have heard about it :-).
Well actually I got the impression from RFC 2070. Check out RFC
2070, p 19. It says:
"UTF-7 [RFC1642]
and UTF-8 [UTF-8] have favorable properties (no byte-ordering
problem, different flavours of ASCII compatibility) that make them
worthy of consideration, especially for transmission of multilingual
text. Another encoding scheme, MNEM [RFC1345], also has interesting
properties and the capability to transmit the full UCS."
Plus when Dave Beckett and I spoke on the phone the other week, he said
that the Unicode chaps would be happy with a default encoding of DC in
HTML of UTF-8. So what is the default encoding of non-ISO-8859-1
characters in the future I18N version of HTML and why does RFC 2070 seem
to imply that UTF-8 is the way to go?
> Internationalization is about making systems so that people in all
> nations and across nations can use the systems the way they want,
> in particular with their scripts and languages.
Right, so the URI stuff isn't Internationalization. Its localization
where people in specific countries/regions are given more friendly
interfaces to their systems, sometimes at the expense of use by people in
other nations (so that, for example, a Japanese chap can create a
localised URL for a resource of only Japanese interest which his
countrymen can transcribe easily but which I would have great difficulty
using, but wouldn't be interested in most of the time anyway). The
metadata stuff on the other hand is I18N want to be able to convert
metadata between languages and scripts on the fly for people.
Is that a reasonable summary?
> - That with a default of Klingon, more than 99% of the items are incorrectly tagged.
> - That with a default of English, something between 60% and 80% of the items
> are incorrectly tagged.
> - That with a default of "Unknown", something below 1% of the items are
> incorrectly tagged.
I wouldn't choose Klingon as the default as I'm not aware of any large
community of Klingon metadata users; I want a default that is used by the
majority of metadata users. And that to my mind is Number 2. So I'll
have number 2 please. I definately don't want option 3 where I will have
to drop everything on the floor if it doesn't have a tag.
> There is a big difference between character encoding information and
> language information. Please answer the following questions:
>
> (1) What language do you think the following is in:
> 1997-02-28:12:18:55.657
International English (no tag to say otherwise).
> (2) What language do you think the following is in:
> Martin T. Heller
International English (no tag to say otherwise).
> (3) Would you be able to answer the above questions if your system
> didn't know whether the mail I sent you was in ASCII or
> in EBCDIC?
Yes, because my default language is International English. Point proven.
Cheers.
> Would you please care to go to your local community public library
> and ask them whether they have books e.g. in French or German?
> And whether their computer system knows which books are in
> foreign languages? And whether their computer system knows
> which foreign books actually have English titles, and which
> English books have foreign (e.g. French) titles? And whether
> their system knows which English books have foreign authors,...
> And please then decide whether you would be able to write a
> little program to convert their data to 100% correctly
> language tagged metadata.
Luckily I know the answer to this already as I help run our library OPAC
The answer is that we do have books in French and German (and many other
languages). And yes, our computer system does know which
books are in foriegn languages. The main language of the work is held in
the 008 field of the MARC record and inside the OPAC's SQL database it
appears in the MONO_008 table. Here's an example that I picked at random
after searching for the word "Deutsch" (this work is "Deutsch
2000:Grammatik der modernen deutschen Umgangsprache"):
--- 8< snip snip 8< ----
1> select * from MONO_008 where WORK_ID = 310908
2> go
WORK_ID PUBLICATION_DATE COUNTRY_OF_PUBLICATION ILLUSTRATION_CODES
INTELLECTUAL_LEVEL PHYSICAL_MEDIUM FORM_OF_PUBLICATION
GOVT_PUBLICATION CONFERENCE_PROCEEDINGS LITERARY_TEXT
BIOGRAPHY_CODE
MAIN_LANGUAGE DATE_ON_FILE
----------- ---------------- ---------------------- ------------------
------------------ --------------- -------------------
----------------
---------------------- ------------- -------------- -------------
--------------------------
310908 s1976 gw NULL
NULL NULL NULL
NULL NULL NULL NULL
ger Feb 23 1994 4:52PM
--- 8< snip snip 8< ----
Note the "ger" in the last row; for an English work this would be "eng".
Using MARC tag 246 the system can tell you if variants of the
title have been catalogued in other languages in other languages (in this
case, we only have the German title - I'd guess that's the norm here as I
don't think any of our cat'n'class staff are translators). I don't think
the OPAC records country of origin of authors so I can't do the last one
and the names are all Romanised forms.
I reckon I could (if I had the time and inclination) write a Perl script
that would extract a work from the OPAC's database and write out a DC
encoded file with the appropriate language element and language
qualifiers.
> Metadata without language tag will not be discarded. It will be
> searched, displayed, printed, read, and so on. It will be a tad
> less useful that language-tagged metadata, but the difference
> won't be much.
Depends what you want to do with it as we've seen.
> > So its too much work for people using a non-default language to tag it as
> > such if we have a default defined, but its not too much work to force
> > everyone to specify the language in use? That doesn't sound right to me
> > at all! What you're saying is that forcing everyone to do more work is
> > easier than forcing a smaller community to do more work.
>
> No. If those in the English community are really sure the only thing
> they deal with is English, then it's peanuts to do the tagging
> automatically. But if they aren't 100% sure, they better tag it
> "unknown" than "English" to stay on the safe side.
But with DC we can't rely on automatic anything. Remember, all metadata
is optional. Everything. The whole lot. If you want to change that,
then be my guest but you might encounter some opposition (not from me
though; I'm getting feed up of this whole circular argument). And if they
tag untagged metadata as "unknown" then they can't make any assumption
about the language and so they can't use it for applications that want to
know what language the metadata is in. I'd rather get some good stuff and
some bad stuff than nothing at all.
> > OK, I've got a question for you now: you're writing an indexing engine
> > that's working in a multilingual environment. You want to be able to let
> > people search in their local language so you want to be able to do some
> > simple translations of keywords to a variety of different natural
> > languages.
>
> You mean a query like "give me the documents in languages A, B, and C
> that contain something about what is called X in English"?
I was thinking more of searching an index for all matches (not just those
in a particular language) where my search term is in language P. The
actual metadata in the resource might not have been in P but I would like
to have the indexing engine translate it into P if it can, so that the end
user can get the maximum number of hits (high recall, low relevance,
standard robot indexing style). If the untagged metadata is tagged as
"unknown" I can't translate it into language P. If we assume a default of
English I can (modulo size/quality of translation dictionaries).
Tatty bye,
Jim'll
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon "Jim'll" Knight, Researcher, Sysop and General Dogsbody, Dept. Computer
Studies, Loughborough University of Technology, Leics., ENGLAND. LE11 3TU.
* I've found I now dream in Perl. More worryingly, I enjoy those dreams. *
|