Thanks Matteo, I had seen Federico's Tesseract training sets. I
started off using them in my work making a training, but ended up
finding that I got better results generating my own from scratch.
Also, a lot of the work to get reasonable OCR results comes from
configuration and replacement rules, as well as the base "this
character looks like this" sort of stuff (see my article for more on
that, if you're curious.)
Nick
On Fri, Mar 15, 2013 at 08:31:21PM +0100, Matteo Romanello wrote:
> Dear Nick (and all),
>
> Just in case you haven't seen this already, Federico Boschetti has also made
> available on his website <http://www.himeros.eu/> the training sets he used for
> Tesseract (see section "Ancient Greek OCR Trainings"). I think it's great that
> a body of distributed, openly available training datasets for OCR engines
> starts to emerge.
>
> If I have the chance next week I will try to add this perhaps together with the
> links and publications provided by Lisa (thanks!) to the wiki page.
>
> Cheers,
> Matteo
>
> On Fri, Mar 15, 2013 at 6:26 PM, Nick White <[log in to unmask]> wrote:
>
> On Fri, Mar 15, 2013 at 09:37:41AM +0100, Notis Toufexis • Νότης Τουφεξής
> wrote:
> > This is good news. I am just wondering, if this too technical for some
> users.
>
> I wouldn't say Tesseract was too technical for many users, really.
> There are several different GUIs which wrap around the engine and
> make it easy to use, there is a list at:
> http://code.google.com/p/tesseract-ocr/wiki/3rdParty
>
> Admittedly it isn't as geared to desktop users as something like
> ABBYY, but it shouldn't be too much work to figure out.
>
> > I remember hearing Greg Crane talking about OCR in an event in London, it
> was
> > about OCR with commercial products, stripping accents and putting them
> back
> > again with the use of scripts -- I might have some notes somewhere.
>
> Sounds like an interesting (if rather unpleasant ;)) idea. I suspect
> just providing a good training file, and tweaking OCR engine
> parameters to ensure things like good line segmentation would be
> preferable, though.
>
> Nick White
>
>
|