JISCMail - DIGITALCLASSICIST Archives

PhiloLogic will do much of this as well, out of the box. Do a  
similarity search, and check the boxes for the words you're interested  
in.

Check out
http://www.lib.uchicago.edu/efts/PERSEUS/latin.html
for an example.

Try misspelling a Latin or English word likely to appear in the  
Perseus texts, and see what happens after you check the Similarity  
Search box.
You can search within all documents, or within a subset, or just one,  
by filling in further search fields (author, title, etc.).

 From the PhiloLogic manual:
Similarity searches allow you to check for similar or alternative  
spellings for your search query that might exist within a collection  
of texts. To execute a similarity search, click the box immediately  
following the main search box labelled Similar Word Search. No  
numbers, textual punctuation, or wildcards are allowed when performing  
similarity searches. After entering your search term and submitting  
your search, if your search string sufficiently resembles (as defined  
by AGREP) strings that exist in the indices, you will be a returned a  
list of potential search terms and checkboxes. The resulting search is  
an OR search incorporating all of your selected search terms.



On Sep 1, 2008, at 7:16 PM, Hugh Cayless wrote:

> Search engine query might have been better :-).
>
> I'd second Daniele's Lucene recommendation, though I'm not sure it  
> will do precisely what you need out of the box.  If you want  
> something tuned to your texts, you're probably still looking at some  
> programming time.
>
> Some of this depends on your environment, obviously.  Java and/or  
> Windows may not be acceptable in some IT organizations.
>
> I found what looks at a cursory glance like a pretty good paper  
> characterizing and comparing F/OSS search engines here: http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>
> HTH,
> Hugh
>
> On Sep 1, 2008, at 8:02 AM, Dot Porter wrote:
>
>> [apologies for the cross-posting, and for the slightly redundant
>> subject line. It's not even very funny.]
>>
>> I'm looking for a search engine to handle what I guess is termed
>> "fuzzy searching" across a corpus of Latin legal texts.
>>
>> Essentially, what we will have are TEI tagged transcriptions, but we
>> will not have detailed parts of speech encoding (and I don't believe
>> it's realistic to add such encoding), so the search could not rely on
>> tags. Variant spellings are a huge issue, so we
>> would like a search that is "smart" in the sense that it will have
>> some kind of algorithmic approach to finding potential variant
>> spellings (as opposed to relying on a list of known variant
>> spellings). We do not want to rely on any kind of Boolean searching
>> (commas, curly brackets, etc.). We want a search where the user will
>> discover the variants *after the fact* (once the search is done),
>> rather than having to make a determination about what those variants
>> might be ahead of time. Finally, the search will need to work both
>> within single transcriptions, and across multiple transcriptions
>> (potentially across the entire corpus).
>>
>> Is anyone on-list familiar with any existing search engines or  
>> frameworks that
>> suit our needs, or that might be modified to suit them?
>>
>> Thanks,
>> Dot