Hi Ryan:
We have been using TreeTagger to pretty good effect for Greek and
Latin and should try RFTagger sometime soon.
You do need a good-sized disambiguated corpus, a little larger than
usually quoted 40,000 words. A good-sized decision tree starts to form
at about twice that size. See perseus.uchicago.edu/about.html for more
information and links to abstracts and such.
We have done quite a bit of manual disambiguation ourselves, but we
have also borrowed disambiguated corpora from elsewhere to train the
tagger.
On Jun 1, 2010, at 1:47 PM, Ryan Baumann wrote:
> A while back I looked for resources for lemmatizing Latin and Ancient
> Greek, and came up with very little beyond the Morpheus lemma
> dictionaries. Important to note that you can get the XML dictionaries
> from the Perseus Hopper source code downloads at
> http://sourceforge.net/projects/perseus-hopper/files/ (when you
> extract they will be at e.g. sgml/reading/greek.morph.xml). For the
> Carolingian Canon Law search engine
> (http://www.stoa.org:8080/cclxtf/search) I used an XSL transform that
> would convert the Morpheus XML to XTF's lemma map text format (source
> available at http://halsted.vis.uky.edu/gitweb?p=cclxtf.git).
>
> There is also the Schinke Latin stemmer
> (http://snowball.tartarus.org/otherapps/schinke/intro.html). Instead
> of using a dictionary for lemmatization this would be a fixed
> algorithmic stemming approach based on assumptions about the language
> itself. An introduction to some of the differences between various
> approaches can actually be found in the thesis "Development of a
> Stemmer for the Greek Language"
> (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.623), or
> if you just search for "stemming and lemmatization". One suggestion
> for stemming that I came across was to train the trainable stemmer
> Egothor on your source language, though I'm not sure how much work
> this would require to get good results:
> http://www.egothor.org/book/bk01ch01s06.html. There is also
> MorphAdorner (http://morphadorner.northwestern.edu/) but I think as-is
> it can only recognize Latin (not lemmatize it). Again for Latin there
> is also Collatinus (http://www.collatinus.org/) aimed at lemmatizing
> into a dictionary for definitions. You could use its dictionary to do
> Latin-only lemmatization, but I don't think it's as extensive as the
> Morpheus dictionaries.
>
> Another important thing to keep in mind is that many of these
> approaches are context-free (i.e. some words are ambiguous and
> depending on context could more accurately be lemmatized to either one
> lemma or another). One solution (which sounds, to me, maddening) is to
> run a first pass of automated lemmatization and then manually
> disambiguate any ambiguous results and store that as metadata. Perhaps
> one could use some of the Perseus Treebank data
> (http://nlp.perseus.tufts.edu/syntax/treebank/) to improve
> lemmatization based on context automatically?
>
> Really I've found it somewhat shocking that, given the number of
> ancient texts we have (and digital searchable corpora), there is so
> little easily available information on techniques for doing lemmatized
> searching of them. Though I would love to have missed something
> obvious!
>
> Cheers,
> -Ryan
>
> On Tue, Jun 1, 2010 at 12:36 PM, Hugh Cayless
> <[log in to unmask]> wrote:
>> Friends,
>>
>> I'm deep into rebuilding papyri.info from the ground up, and I've
>> gotten to the part where I need lemmatized search (i.e., you put in
>> a lemma and the search engine finds documents with all the forms of
>> that word in the corpus). I have a database constructed by doing
>> lookups on the Archimedes Project Morphology Service using words
>> from the papyri.info search index, but I find myself in the
>> position of wondering again whether something better has become
>> available in the year since I last did this.
>>
>> I know the PhiloLogic folks have been enhancing its morphological
>> search capabilities for the Perseus texts, for example. What's out
>> there that might help me?
>>
>> Thanks for any help!
>>
>> Hugh
>>
|