Just a thought, but you want to find the terms in your set of texts
that aren't good discriminators. If you go ahead and build a lucene
index, you can use Luke (http://www.getopt.org/luke/) to tell you the
top ranking terms. The top N (and of course you have to figure out
what N is) in that list will probably be what you want to build your
bigrams from.
H
On Aug 25, 2008, at 6:34 PM, Neven Jovanović wrote:
> Benjamin is on the right track: I am also thinking about publishing
> some
> texts in an XTF system, and how to prepare filters for more efficient
> searches.
>
> Perseus under PhiloLogic has the option of including or excluding
> "Filtered Words"; unfortunately, at the moment one cannot see which
> are
> these words (cf. http://www.lib.uchicago.edu/efts/PERSEUS/
> latin.html#,
> under "Refined Search Results").
>
> Also, the Intratext project (http://www.intratext.com/LATINA/) has
> some
> interesting statistics, but by individual texts only (cf. e. g. for
> Apuleius' Apologia (http://www.intratext.com/IXT/LAT0533/_STAT.HTM);
> by
> the way, Marion, Res gestae divi Augusti are also there
> (http://www.intratext.com/y/LAT0278.HTM), and have their own
> statistics
> (http://www.intratext.com/IXT/LAT0278/_STAT.HTM).
>
> One should check to see if something can be got out of the Teubner
> disks,
> perhaps...
>
> But, thank everybody for the status quaestionis.
>
> I agree with Gabriel: to be useful, a list has to eventually be in
> many
> different formats. Also, some documentation on how the list was
> compiled,
> which data was used, what was excluded etc. would be welcome; ditto, a
> bibliography, as started by Tom Elliott.
>
> Neven
>
> Zagreb, Hrvatska / Croatia
|