I agree with MM on two of these.
Alpha acute is a different character from alpha grave, because it has
a different semantic meaning (the tone is different, not simply the
position as is the case with e.g. sigma terminal), it is not merely a
different glyph. The only significant glyph differences in Greek that
I can see enshrined in Unicode are the various sigmas (terminal sigma
is whole issue that Nick has written on well, though I do think that
at least the case of abbreviations simply reveals an ambiguity in our
use of the point rather than a semantic distinction between terminal
sigma and medial sigma. But that's a probably deservedly
idiosyncratic position, and Nick's guidance should be followed there.
I think the lunate sigma is probably a glyph variant: it is about
typography and not text, and if one really wants a lunate sigma one
should use a font that assigns lunate glyphs to both the terminal and
medial sigma code points; but there may be a distinct usage of the
lunate sigma that I'm not aware of or forgetting.)
Morpheus should eventually be rewritten to handle Unicode as well as
betacode, unless it is so much harder that the kludge of a transcoder
at each end is more acceptable.
As for characters that exist in beta code and not in Unicode, this is
another issue entirely. A number of people either on this list or on
the [log in to unmask] list have been working on this, but it will
take a while to get the code points assigned. Some of the
"characters" are markup and should not be assigned Unicode code
points. Some have semantic differences from existing characters which
the Unicode Consortium (rightly or wrongly) considers insignificant
(take e.g. the acrophonic numerals with the same letterforms as
letters: is it really fair to expect those to be encoded differently
when the original users may have considered them to be the same
characters?). Some are very rare characters which may be variants of
existing characters and would be best represented with the existing
character and distinguishing markup. And some are idiosyncratic
characters which perhaps should not be encoded at all.
[For example, I don't think the Phaistos Disk should be encoded
because it is a "script" with only one exemplar, and could be
anything - a board game, some kind of calendrical or logistical tool,
an exercise in creating "signs" by someone who had seen e.g.
hieroglyphs but didn't want to actually invent a script - maybe even
an ancient fraud pretending to be an Egyptian item created by someone
who only knew of hieroglyphs by report.]
I would recommend looking through http://www.tlg.uci.edu/
quickbeta.pdf and determining if any information would be lost by
converting your corpus to Unicode. If so, is it the sort of
information that really ought to be in markup? Is it something that
can be indicated by a combination of existing characters that would
be typographically indistinguishable from existing characters (even
if only with the use of markup) and which one could reasonably
explain as an abbreviation rather than as an alphabetic/syllabic
character, diacritic, logogram, or ideogram? Is it only one or two
occurrences that you could reasonably indicate with a markup element
that should be replaced with a graphic (I'm thinking of some of those
idiosyncratic coronis example)? In all of these cases, you're going
to have a hard time displaying characters that are in betacode but
not in Unicode anyway, so if you're primarily interested in
displaying texts for reading and not in using your corpus for
analysis, you'll have to deal with this problem anyway.
So the final answer I'd propose is this: if you have a corpus that
requires the distinction of betacode characters that cannot be
resolved into a non-ambiguous unicode character sequence, use
betacode for storage and Unicode for display purposes and web
services (and make sure you come up with SOME way of normalizing the
betacode into something a non-betacode savvy user can understand),
and be prepared to migrate to Unicode as your backend when it becomes
feasible. If it doesn't become feasible, either a.) you are thinking
about your characters in the wrong way and Unicode is right not to
support the characters you want, b.) you have noticed a real gap in
Unicode's coverage (missing precomposed characters that can be
represented with a spacing character and a combining diacritical
don't count) and should talk to someone like Deborah Anderson, Nick
Nicholas, or Michael Everson about it, or 3.) it's in Unicode, but
you need to find someone to champion it with the OS vendors or the
typographers.
If you have a corpus that uses only characters that are already
encoded in Unicode and well-supported by the three major environments
(Windows XP+, OS X+, Unix/Linux distributions newer than 2002) - well
supported meaning that the standard distribution comes with at least
one font that can represent each character legibly, if not
beautifully - there's really no point to not use Unicode.
Patrick Rourke
On Aug 27, 2005, at 7:00 PM, DIGITALCLASSICIST automatic digest
system wrote:
> #1 isn't really quite accurate. It is true that alpha+acute and alpha
> +grave are separate Unicode characters. But it is also the case that
> the different alphas are or can be bundled for search procedures.
> Thus on the Macintosh character palette, an alpha with any accent
> will bring up all the other alphas. And Java lets you search on a
> case/diacritic inensitive basis.
>
> #2 is true. What would it take to rewrite morpheus to accept Unicode
> or write a preprocessing routine that converts Unicode to betacode
> when you want to feed morpheus?
>
> #3 is also the case. But it is theoretically and practically possible
> to generate appropriate Unicode sequences.
|