I've done quite a bit of work recently on getting the Tesseract OCR
engine to cope well with Ancient Greek, as part of the the ERC
project Living Poets [1]. The 'training' file
resulting from that is now available from their website. I wrote an
article on how I went about it and some of the issues involved,
which is available at [2], with associated code and whathaveyou at
[3].
This should certainly be pointed to on the Digital Classicist wiki
page. I'm happy to add something, but won't get a chance to until
next week - if anyone else wants to do it for me go right ahead!
As to the question of when OCR is appropriate as opposed to hand
keying, I'd say that the quality of OCR output is now good enough
that in general OCRing and then correcting the result is going to be
the best option.
There is certainly scope for more tools to make such hand correction
faster and easier, for example configuring Tesseract to highlight
words / characters it is least certain about, but that would require
a little programming.
Hope this is useful,
Nick White
1. http://www.dur.ac.uk/classics/livingpoetsproject/ - there will be
a proper website very soon!
2. http://www.eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf
3. http://www.dur.ac.uk/nick.white/grctraining/
|