JISCMail - DIGITALCLASSICIST Archives

Are the manually disambiguated New Testament or other Greek/Latin
corpora you used (or created) for training TreeTagger available for
download somewhere? The Perseus Treebanks give 192k Greek (though 71%
Homer which you note doesn't extend well to classical Greek) and 53k
Latin words that could presumably be used in NLP software such as
TreeTagger, but the more the merrier (esp. if it takes ~80k to start
getting good results).

Thanks,
-Ryan

On Tue, Jun 1, 2010 at 3:05 PM, Helma Dik <[log in to unmask]> wrote:
> Hi Ryan:
>
> We have been using TreeTagger to pretty good effect for Greek and Latin and
> should try RFTagger sometime soon.
> You do need a good-sized disambiguated corpus, a little larger than usually
> quoted 40,000 words. A good-sized decision tree starts to form at about
> twice that size. See perseus.uchicago.edu/about.html for more information
> and links to abstracts and such.
> We have done quite a bit of manual disambiguation ourselves, but we have
> also borrowed disambiguated corpora from elsewhere to train the tagger.
>
>
>
> On Jun 1, 2010, at 1:47 PM, Ryan Baumann wrote:
>
>> A while back I looked for resources for lemmatizing Latin and Ancient
>> Greek, and came up with very little beyond the Morpheus lemma
>> dictionaries. Important to note that you can get the XML dictionaries
>> from the Perseus Hopper source code downloads at
>> http://sourceforge.net/projects/perseus-hopper/files/ (when you
>> extract they will be at e.g. sgml/reading/greek.morph.xml). For the
>> Carolingian Canon Law search engine
>> (http://www.stoa.org:8080/cclxtf/search) I used an XSL transform that
>> would convert the Morpheus XML to XTF's lemma map text format (source
>> available at http://halsted.vis.uky.edu/gitweb?p=cclxtf.git).
>>
>> There is also the Schinke Latin stemmer
>> (http://snowball.tartarus.org/otherapps/schinke/intro.html). Instead
>> of using a dictionary for lemmatization this would be a fixed
>> algorithmic stemming approach based on assumptions about the language
>> itself. An introduction to some of the differences between various
>> approaches can actually be found in the thesis "Development of a
>> Stemmer for the Greek Language"
>> (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.623), or
>> if you just search for "stemming and lemmatization". One suggestion
>> for stemming that I came across was to train the trainable stemmer
>> Egothor on your source language, though I'm not sure how much work
>> this would require to get good results:
>> http://www.egothor.org/book/bk01ch01s06.html. There is also
>> MorphAdorner (http://morphadorner.northwestern.edu/) but I think as-is
>> it can only recognize Latin (not lemmatize it). Again for Latin there
>> is also Collatinus (http://www.collatinus.org/) aimed at lemmatizing
>> into a dictionary for definitions. You could use its dictionary to do
>> Latin-only lemmatization, but I don't think it's as extensive as the
>> Morpheus dictionaries.
>>
>> Another important thing to keep in mind is that many of these
>> approaches are context-free (i.e. some words are ambiguous and
>> depending on context could more accurately be lemmatized to either one
>> lemma or another). One solution (which sounds, to me, maddening) is to
>> run a first pass of automated lemmatization and then manually
>> disambiguate any ambiguous results and store that as metadata. Perhaps
>> one could use some of the Perseus Treebank data
>> (http://nlp.perseus.tufts.edu/syntax/treebank/) to improve
>> lemmatization based on context automatically?
>>
>> Really I've found it somewhat shocking that, given the number of
>> ancient texts we have (and digital searchable corpora), there is so
>> little easily available information on techniques for doing lemmatized
>> searching of them. Though I would love to have missed something
>> obvious!
>>
>> Cheers,
>> -Ryan
>>
>> On Tue, Jun 1, 2010 at 12:36 PM, Hugh Cayless <[log in to unmask]>
>> wrote:
>>>
>>> Friends,
>>>
>>> I'm deep into rebuilding papyri.info from the ground up, and I've gotten
>>> to the part where I need lemmatized search (i.e., you put in a lemma and the
>>> search engine finds documents with all the forms of that word in the
>>> corpus).  I have a database constructed by doing lookups on the Archimedes
>>> Project Morphology Service using words from the papyri.info search index,
>>> but I find myself in the position of wondering again whether something
>>> better has become available in the year since I last did this.
>>>
>>> I know the PhiloLogic folks have been enhancing its morphological search
>>> capabilities for the Perseus texts, for example.  What's out there that
>>> might help me?
>>>
>>> Thanks for any help!
>>>
>>> Hugh
>>>
>