At 18:47 02/12/97 -0500, [log in to unmask] wrote:
>
>Has anyone any good tips for optimising OCR-ing? We're using a HP
>Scanjet 5p and Adobe Capture. This combination works brilliantly
>for PDF but it's not so good for OCR-ing documents to convert to HTML.
>It works - but it's quite slow and not very accurate.
>
>Has anyone else had similar problems? I wondered if it's worth getting
>Adobe PageMill - if anyone's used this, I'd welcome their opinion.
>This or any other advice on how to get our present system to work more
>efficiently, or any recommendations as to alternative software would be
>welcomed! Thanks!
>
>All the best
>Chris
Chris,
Interesting question. I assume that the process you are describing above is
one where you are using Capture to drive the scanner and create output
files. But rather than output a file in PDF format, the output is in Word
or ASCII types of format which you then convert to HTML? The actual output
text should be no more or less accurate than when Capture did the
conversion to PDF, the main difference is that PDF has the option of
displaying bitmaps of words where there is lower confidence in their
accuracy and in text outputs this is not possible so more of the
inaccuracies become apparant.
I would recommend not using any element of Capture as either a driver for
your scanner or for the OCR (creating non-PDF output is not Adobe's primary
purpose and thus the software is not optimal). If you used a software
product like OmniPage or Textbridge which are designed to do nothing other
than OCR you may well get a much better level of accuracy. You also have
the opportunity to consider flushing the same page through the OCR engine a
number of times, as the OCR software is adaptive and will improve with more
attempts or you could even do ranking of results.
Please note though that OCR accuracy is pretty dependant on the condition
of the original material scanned. The smaller the character sizes, the
dirtier the page, the more complex the page layout, the more variance
across the page in font, ink depth, fading etc. the more that the OCR
accuracy will be compromised. There is obviously a trade off to be made in
terms of the amount of editing and manual correction that you might be
prepared to do and the amount of time spent at optimising the OCR stage.
Hope this helps, feel free to contact me if you need further assistance.
Regards,
Simon
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Simon Tanner Email: [log in to unmask]
Digitisation Consultant Phone: 01707 286078
Higher Education Digitisation Service Fax: 01707 286079
University of Hertfordshire Web: http://heds.herts.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|