medieval-religion: Scholarly discussions of medieval religion and culture
From: Chris Laning <[log in to unmask]>
> Since it's impossible (as far as I know) for a search engine to
extract sense from mere pixels, there must be some actual text around
there somewhere.
it would seem so.
> (What's your basis, BTW, for concluding that the PDF files here are
"just graphics"?)
just because they are not searchable.
> I have a vague recollection that one of the online historical-manuscript
publishers (Project Gutenberg?) does something like this: what's displayed is
indeed just graphics, but hidden behind the scenes is a parallel text version
which is what the search engine actually works on.
yes, that would certainly work.
question is, how did they generate that "parallel text version" ?
OCR, presumably.
and that *might* work for modern printed books.
but i seriously, seriously doubt that the bnf.fr has actually OCRed all its
stuff on Gallica.
that task would just be next to impossible --it would take decades.
i've tried OCRing some of their stuff (downloading it in .tiff format and
importing it into FineReader) and the error rate --even on the "cleanest"
pages-- is very high, 10-20%.
it takes *hours* to correct the errors.
and many of the pages on the site --esp. of older books-- are *filthy* with
stray marks (which are, of course, "read" by the OCR software).
they also have rather exotic typography, which just drives the OCR nuts.
no, they weren't OCRed --at least by any method i'm aware of.
though it's come a long, long way in the 20 years or so i've been using it
(started out on an old, state-o-the-art, $40,000 Kurtzweil machine) i don't
see how OCR software will *ever* be able to handle old books.
the only other solution i can think of is that they've got folks on the
Gubbermint payroll who actually go through each document (page) and keystroke
all the keywords into an index, and it is *that* which is searched.
not quite as much time to do as a full OCRing, but still a damned slow
process, it seems to me.
i'm open to other suggestions as to what's going on.
c
**********************************************************************
To join the list, send the message: join medieval-religion YOUR NAME
to: [log in to unmask]
To send a message to the list, address it to:
[log in to unmask]
To leave the list, send the message: leave medieval-religion
to: [log in to unmask]
In order to report problems or to contact the list's owners, write to:
[log in to unmask]
For further information, visit our web site:
http://www.jiscmail.ac.uk/lists/medieval-religion.html
|