On 12/01/2016 15:14, Gabriel BODARD wrote:
> I'd be interested to hear what you end up doing, Paolo (and for that
> matter--what some of the other possibilities on offer are...).
>
> This could well be a FAQ at the Digiclass wiki, couldn't it? Would you
> and Marco be interesting it outlining that, Eleonora?
Following up to Gabby's request, I thought that it might be useful to
someone if I share what I've found so far (and ask one more question on
list).
I'm currently experimenting on four options:
1. sending Python queries to Perseus Morpheus for Latin through the
Archimedes Project Morphology Service (XML-RPC Interface,
http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html). This
option requires no installation, and provides you with a list of
possible lemmas/morphological analyses. It's fairly easy, I already
implemented it. It returns a list of possible lemmas/POS tags for each form.
2. using the Python Classical Language ToolKit (CLTK, see
http://docs.cltk.org/en/latest/latin.html#lemmatization). This is based
on the NLTK and seems not to use Morpheus (please correct me if I'm
wrong), but the NLTK part-of-speech parser. This requires installation
(not a big deal), and chooses the best lemma/morph. analysis for you. I
also wrote my own Python code for this: fairly easy too.
3. sending queries to Perseus Morpheus for Latin through the JavaScript
code that is available on GitHub at
https://github.com/balmas/tei-digital-age/blob/master/src/js/tei-lod.js#L134-L195
and
https://github.com/alpheios-project/arethusa/blob/master/app/js/arethusa.morph/bsp_morph_retriever.js#L67-L115
(the Arethusa code).
4. LatLem looks like a great software, and we'll hear more about it
soon, but I'll leave it to Eleonora Litta Modigliani and Marco
Passarotti, who are working on this.
5. Eleonora and Marco suggested that I download/install a lemmatizer/POS
tagger like like TreeTagger and
5a. use it with one of the two 'parameter files' that come with it the
TreeTagging website,
5b. or better re-train it on one of the Latin treebanks found in
Universal Dependencies website
(https://universaldependencies.github.io/docs/),
and then lemmatize/tag my text with TreeTagger.
5c. Or let TreeTagger choose among one one of the lemmas/POS tags
outputted by Morpheus
First off, a question to Bridget: do #1 and #3 ultimately use the same
lemmatizer/parser (Morpheus), so it's just a matter of interface?
For me, the choice between #1 and #2 really depends on whether you want
to get a list of possible lemmas/morphological analyses to choose within
(#1) or you prefer the software to choose the best one for you (#2).
To sum up: I already wrote the code to implement options #1 and #2, but
am not yet satisfied, because
#1 requires me to choose within a list of possible lemmas/POS tags,
which would take long.
#2 seems to choose one best lemma/POS tag, but only based on the
frequency in which that lemma/POS tag is correct in general (i.e.: it
does not make the contextual evaluation that TreeTagging does) -- Please
correct me if I'm wrong.
I'm currently following the path outlined in #5a and #5b. I downloaded
and installed TreeTagger as well as the treebanks and am currently
learning how to use it. It's a long shot, but this is a very fascinating
field and want to see through it.
I shared the code I wrote for #1 and #2 in
https://github.com/paolomonella/ursus/tree/master/lemma
Best,
Paolo
|