JISCMail - DIGITALCLASSICIST Archives

Hi again!

The morphological database lives a standoff existence:-) (See poster  
at http://cybergreek.uchicago.edu/implementingposter.pdf).
However, for Gabby Bodard's upcoming classical THATcamp we aim to make  
xml files with word tags available.

We currently have Iliad, Odyssey, Hesiod (all based on Martin  
Mueller's work), and 50+K of classical period Greek. We will push on  
integrating the treebank Aeschylus soon, but even when texts have the  
same proximate digital 'parent' the challenge to align them and make  
them consistent with our existing data is substantial. Still, it's  
probably less work than doing the whole Aeschylus by hand, again:-)  
Adding Aeschylus will get us over 300K, but more importantly, having a  
big chunk of tragedy in a training set will make the tagger perform  
better on the rest of the drama collection.

We no longer use the Greek bible for our training data or try to  
incorporate it in our perseus.uchicago.edu collection. The  
disambiguated text was not the Perseus text and it was going to be too  
much work to get it up to the same consistency as the epic sample or  
our own disambiguated text. And besides, there are apparently legal  
issues with morphological NTs in cyberspace. So, no trace of it in our  
corpus any longer, which mostly means that we are weak on biblical  
names currently..

The Latin treebank sample, which has not grown since last year, is  
precisely what we used for our training data. I have a cleaned up,  
TreeTagger-formatted, version of it. Somewhere. :-) As with the Greek  
bible, we did not align the Latin so even for chunks that were part of  
the training data you will simply see a probability score when you  
look at the text on our system.

All best,
Helma

On Jun 2, 2010, at 4:10 PM, Ryan Baumann wrote:

> Are the manually disambiguated New Testament or other Greek/Latin
> corpora you used (or created) for training TreeTagger available for
> download somewhere? The Perseus Treebanks give 192k Greek (though 71%
> Homer which you note doesn't extend well to classical Greek) and 53k
> Latin words that could presumably be used in NLP software such as
> TreeTagger, but the more the merrier (esp. if it takes ~80k to start
> getting good results).
>
> Thanks,
> -Ryan
>
> On Tue, Jun 1, 2010 at 3:05 PM, Helma Dik <[log in to unmask]>  
> wrote:
>> Hi Ryan:
>>
>> We have been using TreeTagger to pretty good effect for Greek and  
>> Latin and
>> should try RFTagger sometime soon.
>> You do need a good-sized disambiguated corpus, a little larger than  
>> usually
>> quoted 40,000 words. A good-sized decision tree starts to form at  
>> about
>> twice that size. See perseus.uchicago.edu/about.html for more  
>> information
>> and links to abstracts and such.
>> We have done quite a bit of manual disambiguation ourselves, but we  
>> have
>> also borrowed disambiguated corpora from elsewhere to train the  
>> tagger.
>>
>>
>>
>> On Jun 1, 2010, at 1:47 PM, Ryan Baumann wrote:
>>
>>> A while back I looked for resources for lemmatizing Latin and  
>>> Ancient
>>> Greek, and came up with very little beyond the Morpheus lemma
>>> dictionaries. Important to note that you can get the XML  
>>> dictionaries
>>> from the Perseus Hopper source code downloads at
>>> http://sourceforge.net/projects/perseus-hopper/files/ (when you
>>> extract they will be at e.g. sgml/reading/greek.morph.xml). For the
>>> Carolingian Canon Law search engine
>>> (http://www.stoa.org:8080/cclxtf/search) I used an XSL transform  
>>> that
>>> would convert the Morpheus XML to XTF's lemma map text format  
>>> (source
>>> available at http://halsted.vis.uky.edu/gitweb?p=cclxtf.git).
>>>
>>> There is also the Schinke Latin stemmer
>>> (http://snowball.tartarus.org/otherapps/schinke/intro.html). Instead
>>> of using a dictionary for lemmatization this would be a fixed
>>> algorithmic stemming approach based on assumptions about the  
>>> language
>>> itself. An introduction to some of the differences between various
>>> approaches can actually be found in the thesis "Development of a
>>> Stemmer for the Greek Language"
>>> (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.623), or
>>> if you just search for "stemming and lemmatization". One suggestion
>>> for stemming that I came across was to train the trainable stemmer
>>> Egothor on your source language, though I'm not sure how much work
>>> this would require to get good results:
>>> http://www.egothor.org/book/bk01ch01s06.html. There is also
>>> MorphAdorner (http://morphadorner.northwestern.edu/) but I think  
>>> as-is
>>> it can only recognize Latin (not lemmatize it). Again for Latin  
>>> there
>>> is also Collatinus (http://www.collatinus.org/) aimed at lemmatizing
>>> into a dictionary for definitions. You could use its dictionary to  
>>> do
>>> Latin-only lemmatization, but I don't think it's as extensive as the
>>> Morpheus dictionaries.
>>>
>>> Another important thing to keep in mind is that many of these
>>> approaches are context-free (i.e. some words are ambiguous and
>>> depending on context could more accurately be lemmatized to either  
>>> one
>>> lemma or another). One solution (which sounds, to me, maddening)  
>>> is to
>>> run a first pass of automated lemmatization and then manually
>>> disambiguate any ambiguous results and store that as metadata.  
>>> Perhaps
>>> one could use some of the Perseus Treebank data
>>> (http://nlp.perseus.tufts.edu/syntax/treebank/) to improve
>>> lemmatization based on context automatically?
>>>
>>> Really I've found it somewhat shocking that, given the number of
>>> ancient texts we have (and digital searchable corpora), there is so
>>> little easily available information on techniques for doing  
>>> lemmatized
>>> searching of them. Though I would love to have missed something
>>> obvious!
>>>
>>> Cheers,
>>> -Ryan
>>>
>>> On Tue, Jun 1, 2010 at 12:36 PM, Hugh Cayless  
>>> <[log in to unmask]>
>>> wrote:
>>>>
>>>> Friends,
>>>>
>>>> I'm deep into rebuilding papyri.info from the ground up, and I've  
>>>> gotten
>>>> to the part where I need lemmatized search (i.e., you put in a  
>>>> lemma and the
>>>> search engine finds documents with all the forms of that word in  
>>>> the
>>>> corpus).  I have a database constructed by doing lookups on the  
>>>> Archimedes
>>>> Project Morphology Service using words from the papyri.info  
>>>> search index,
>>>> but I find myself in the position of wondering again whether  
>>>> something
>>>> better has become available in the year since I last did this.
>>>>
>>>> I know the PhiloLogic folks have been enhancing its morphological  
>>>> search
>>>> capabilities for the Perseus texts, for example.  What's out  
>>>> there that
>>>> might help me?
>>>>
>>>> Thanks for any help!
>>>>
>>>> Hugh
>>>>
>>