Print

Print


Dear Nick (and all),

Just in case you haven't seen this already, Federico Boschetti has also made available on his website <http://www.himeros.eu/> the training sets he used for Tesseract (see section "Ancient Greek OCR Trainings"). I think it's great that a body of distributed, openly available training datasets for OCR engines starts to emerge.

If I have the chance next week I will try to add this perhaps together with the links and publications provided by Lisa (thanks!) to the wiki page.

Cheers,
Matteo

On Fri, Mar 15, 2013 at 6:26 PM, Nick White <[log in to unmask]> wrote:
On Fri, Mar 15, 2013 at 09:37:41AM +0100, Notis Toufexis • Νότης Τουφεξής wrote:
> This is good news. I am just wondering, if this too technical for some users.

I wouldn't say Tesseract was too technical for many users, really.
There are several different GUIs which wrap around the engine and
make it easy to use, there is a list at:
http://code.google.com/p/tesseract-ocr/wiki/3rdParty

Admittedly it isn't as geared to desktop users as something like
ABBYY, but it shouldn't be too much work to figure out.

> I remember hearing Greg Crane talking about OCR in an event in London, it was
> about OCR with commercial products, stripping accents and putting them back
> again with the use of scripts -- I might have some notes somewhere.

Sounds like an interesting (if rather unpleasant ;)) idea. I suspect
just providing a good training file, and tweaking OCR engine
parameters to ensure things like good line segmentation would be
preferable, though.

Nick White