Hi everyone,
I'm interested in learning from anyone's experience of undertaking a scanning project on well formatted text (in which they hold copyright, or where material is public domain), with definition of zones on the page, and the generation of structured data (eg XML) based on the zone-defined OCR.
In other words, has anyone worked with a third party supplier of OCR application, or developed an application in house, to do achieve something like this:
1. my source (eg a reference book) has predictable formatting and layout for identifiable properties (eg title, date, summary, image) so
2. in the scanning / OCR application, I generate a template or model based on those layout zones (eg top row is title; second row is date and country of publication, separated by a comma or hyphen; rows 3 - ? are summary, with horizontal line separating from next entity or record)
3. I scan and apply an OCR engine to the template, to capture data from the text and
4. generate a structured record for each scanned entity, writing XML - eg
<record number="12345" page="56" book="We Own the Rights">
<title>Tennis is quite boring</title>
<date>2013-07-05</date>
<summary>Tennis may even be more boring than football.</summary>
</record>
The BBC's Genome project, to digitise the Radio Times, is the highest profile example I can think of, in terms of this kind of zoned OCR to generate XML from formatted text, but I'm interested in hearing about and learning from any analogous, smaller scale projects.
If someone has experience of an off-the-shelf product to achieve that, or has worked with an application or service supplier to achieve similar, I would love to hear about that experience, specific recommendations, etc
Stephen McConnachie,
BFI
****************************************************************
website: http://museumscomputergroup.org.uk/
Twitter: http://www.twitter.com/ukmcg
Facebook: http://www.facebook.com/museumscomputergroup
[un]subscribe: http://museumscomputergroup.org.uk/email-list/
****************************************************************
|