Some time ago I sent a message to this list asking for feedback from
scanning experiences. Some responses were sent to the list, but a couple
came to me privately. I have checked with the authors, and they are happy
to have them distributed. So here goes...
Date: Thu, 30 May 1996 10:04:13 +0100
From: Gerard Lowe <[log in to unmask]>
Organization: Modern Humanities Research Association
I saw your posting and thought I would reply to you rather than the list
since I am not involved in an eLib project. I have, however, been involved
in scanning back volumes of the MHRA's Annual Bibliography of English
Language & Literature and this experience may be of help.
>I was wondering if eLib projects involved in scanning and OCRing text would
>be willing to share their experiences on this list. Things like throughput
>and accuracy achieved for particular scanners and OCR software, for
>example.
The best output I have seen is from a Kurzweil scanner (we use the one
at Cambridge University Computing Centre). The source is pretty
clean and the typefaces fairly clear, and the results have been
more accurate than I had expected.
> How much of the time is spent proof-reading, and do you use the
>same people or look for different people with specialist proof-reading
>skills (they must exist)?
After OCRing we run some parsing routines and I then print out the
result for my proof reader to read on paper. Most of the errors
from the OCR are minor (problems with ligatures, spacing of words)
and many can be corrected globally.
The most time-consuming aspect of the job by far is the proof-reading
and mark-up and this is likely to prove the most costly part of the
project. The method one chooses (OCR, double-keying) should probably
depend upon the level of markup one wishes to introduce. I was working
to add records to our online database and the method we have seems
appropriate, but if one is aiming at a more sophisticated level
(SGML being the obvious choice) I think it may be most cost-effective
to double key and have codes inserted by the keyers.
>Does anyone have any feel for what throughput rates they _need_ to achieve
>in order to make their projects viable? Or what price they might be
>prepared to pay for digitisation, and what they would need included in that
>price?
Cambridge University Computing Centre charges about £5 an hour for
the Kurzweil operator. We scanned a 750 A5 page vol for about £100.
The real costs come in proof-reading.
Hope this is useful.
Regards
-------------------------------------------------------------------
Gerard Lowe [log in to unmask]
Editor, ABELL [log in to unmask] (MIME ok)
MHRA http://www.cam.ac.uk/Libraries/MHRA/Gerard.html
University Library Voice: +44 (0) 1223 333058
West Road Fax: +44 (0) 1223 501470
Cambridge CB3 9DR
UK
--------------------------------------------------------------------
And also...
From: Paul Toyne <[log in to unmask]>
Date: Thu, 30 May 1996 12:32:09 +0100 (BST)
I was forwared your mail regarding the use of scanning and OCR-ing software
for eLib projects.
I am working on one such project and am using Adobe Acrobat, or more precicely
Adobe Capture, to OCR the documents.
I have found the Capture sortware to produce very good results, on an average
page it manages and accuracy of about 90% to 95%. After the main scanned image
has been processed it can be run through another program that comes as part of
the Capture Suite called Review. From there you can select the unrecognised
words, and type what they should be, optionally adding them to the dictionary.
Once a word has been added to the dictionary the main conversion process will
recognise future occurances of the words.
I hope this information helps, if you would like more, please don't hesitate
to mail me.
---_____ _____ ___ ___ ______ _____ __ _____ _____
/ ___ \/ ___ \/ / /\/ / /___ __\/ ____\/ /__/\/ ___ \/ ____\
/ _____/ ___ / /_/ / _/___ / / / /_/ /____ / // / ___/_
/__/ /__//__/______/______/ /__/ /______/ /____/__//__/______/
[log in to unmask], [log in to unmask] http://elsa.dmu.ac.uk/~pt
--
Chris Rusbridge
Programme Director, Electronic Libraries Programme
The Library, University of Warwick, Coventry CV4 7AL, UK
Phone 01203 524979 Fax 01203 524981
Email [log in to unmask]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|