Dear Dot and list,
handling variant spellings -- turned on or off ut libet, of course -- is
vital for searches in collections of Latin texts (u = v, ae = e, i = j =
ij etc; perhaps variant spelings are less important for Greek?). The
knowledge and experiences of how these variants were handled in search
engines would be very valuable. So, if anybody would be willing to share
the existing solutions (perhaps even in form of a howto on the
digitalclassicist wiki) -- I, for one, would warmly welcome it. It seems
that here (too) Helma Dik and Perseus under PhiloLogic have gone
furthest...
For our collection of early modern Croatian Latin texts -- including also
legal writings and documents -- we have to implement some kind of
"ortography tables" for similarity searching; as we have been planning to
use the XTF architecture (http://sourceforge.net/projects/xtf/; wiki at
http://xtf.wiki.sourceforge.net/), I will see there how their "spelling
correction" works (http://xtf.wiki.sourceforge.net/underHood_Spelling).
Yours,
Neven
Zagreb, Hrvatska / Croatia
>
> I'm looking for a search engine to handle what I guess is termed
> "fuzzy searching" across a corpus of Latin legal texts.
>
> Essentially, what we will have are TEI tagged transcriptions, but we
> will not have detailed parts of speech encoding (and I don't believe
> it's realistic to add such encoding), so the search could not rely on
> tags. Variant spellings are a huge issue, so we
> would like a search that is "smart" in the sense that it will have
> some kind of algorithmic approach to finding potential variant
> spellings (as opposed to relying on a list of known variant
> spellings). We do not want to rely on any kind of Boolean searching
> (commas, curly brackets, etc.). We want a search where the user will
> discover the variants *after the fact* (once the search is done),
> rather than having to make a determination about what those variants
> might be ahead of time. Finally, the search will need to work both
> within single transcriptions, and across multiple transcriptions
> (potentially across the entire corpus).
>
> Is anyone on-list familiar with any existing search engines or frameworks
> that
> suit our needs, or that might be modified to suit them?
>
> Thanks,
> Dot
>
|