I usually find that a well-thought Lucene implementation fits all such
requirements.
Lucene is perhaps the most used and well-known free open source engine, and
has implementations both in Java and in DotNet: just google for LUCENE. I
also suggest you the book Lucene in Action by O.Gospodnetic and E.Hatcher.
You can build your own tokenizers and analyzers for variants and custom
algorithms, use synonyms in searches, and have a complete query syntax out
of the box with fuzzy queries, boolean clauses, boost factors, range
queries, metadata queries, and the like. Personally I use a DotNet
implementation for a couple of my projects (one of them, Cadmus, is in
www.fusisoft.it).
Good luck!
Daniele Fusi
-----Original Message-----
From: The Digital Classicist List [mailto:[log in to unmask]]
On Behalf Of Dot Porter
Sent: luned́ 1 settembre 2008 14.02
To: [log in to unmask]
Subject: [DIGITALCLASSICIST] Searching for a search engine
[apologies for the cross-posting, and for the slightly redundant
subject line. It's not even very funny.]
I'm looking for a search engine to handle what I guess is termed
"fuzzy searching" across a corpus of Latin legal texts.
Essentially, what we will have are TEI tagged transcriptions, but we
will not have detailed parts of speech encoding (and I don't believe
it's realistic to add such encoding), so the search could not rely on
tags. Variant spellings are a huge issue, so we
would like a search that is "smart" in the sense that it will have
some kind of algorithmic approach to finding potential variant
spellings (as opposed to relying on a list of known variant
spellings). We do not want to rely on any kind of Boolean searching
(commas, curly brackets, etc.). We want a search where the user will
discover the variants *after the fact* (once the search is done),
rather than having to make a determination about what those variants
might be ahead of time. Finally, the search will need to work both
within single transcriptions, and across multiple transcriptions
(potentially across the entire corpus).
Is anyone on-list familiar with any existing search engines or frameworks
that
suit our needs, or that might be modified to suit them?
Thanks,
Dot
|