A little bit of background might be helpful. The following is an
oversimplified explanation of lemmatized and morphological search in
PhiloLogic - I hope I didn't make errors that are too blatant:-)
Richard Whaling's blogpost at
http://artfl.blogspot.com/2010/05/vector-processing-for-ohco.html
links to a presentation that explains how PhiloLogic handles stores
documents, and explains some of the differences with XML and
relational databases -- thereby explaining why your mileage might vary
when trying to do similar things using those types of databases.
For the morphology and lemma search we effectively add the lemmas and
morphological codes (and combinations thereof) to the index.
The same was done by Mark Olsen for the MONK project English data,
which you'll find a blogpost for here:
http://artfl.blogspot.com/2010/01/monk-data-under-philologic-1.html
(search form linked from blogpost).
Mark used a different method here, essentially using the 'space' in
the system we allocate to different input methods (full diacritics, no
diacritics, transliteration) to part-of-speech and lemma -- see it as
normalized spellings, where byte offsets in the database are now
associated with string, parse and lemma.
Note: This implementation for MONK is slightly slower than Richard's
implementation for the classical data, which is not exclusively due to
the size of the corpus. When doing complex searches on the MONK data,
be sure to adjust or disable the browser timeout.
For the morphology, what we can currently handle (imperfectly at
that!) is what words occur in the Perseus Texts, since we don't have
Perseus's Morpheus or some equivalent running as part of this. We tag
new texts with TreeTagger on the basis of the existing database and
use the disambiguated sections of it to assign probabilities. So this
would not be ideal for documentary texts.
PhiloLogic and its siblings PhiloMine (mining) and PhiloLine (sequence
alignment) are open-source and can be found on code.google.com. If
your interest is primarily in the morphological database, do get in
touch about that as well.
All best,
Helma
On Jun 1, 2010, at 11:36 AM, Hugh Cayless wrote:
> Friends,
>
> I'm deep into rebuilding papyri.info from the ground up, and I've
> gotten to the part where I need lemmatized search (i.e., you put in
> a lemma and the search engine finds documents with all the forms of
> that word in the corpus). I have a database constructed by doing
> lookups on the Archimedes Project Morphology Service using words
> from the papyri.info search index, but I find myself in the position
> of wondering again whether something better has become available in
> the year since I last did this.
>
> I know the PhiloLogic folks have been enhancing its morphological
> search capabilities for the Perseus texts, for example. What's out
> there that might help me?
>
> Thanks for any help!
>
> Hugh
|