medieval-religion: Scholarly discussions of medieval religion and culture
From: Chris Laning <[log in to unmask]>
>If you merely scan a page and save it as a PDF, the resulting PDF is only an
image -- little dots on paper. If you want the actual *words* of the text to
exist in the file, where they can be searched, you have to create them, either
by typing them in manually (which is a lot of work) or by using
character-recognition software, which is still very far from perfect and
requires editing/proofreading afterward.
at severe risk of repeating myself yet again, let me point esteamed listers to
what i take to be the "latest" fruits of PDF technology, the volumes of the
_Inventaires sommaire_ of the various _fonds_ in the Archives departementales
d'Eure-et-Loir (Chartres)
http://www.archives28.fr/ec/index2.php
those seriously interested in the issue should actually take a few moments
[assuming a reasonably fast connection] and download a volume of these.
here's the link to the volume "Série G, t. 1. Archives ecclésiastiques,
évêchés, chapitres, séminaires. [29.3 Mb]"
http://www.archinoe.net/cg28/ir_visu_instrument.php?id=195
download that and you'll have a .pdf file, clearly "only an image" but a very,
very clear one (unlike, say, most all of those crappy ones you'll get off the
books.goggle or http://gallica.bnf.fr sites), with nice "bookmarks" to the
table of contents for easy navigation through the tome.
in addition to the high quality of the image(s), the *real* Hit here is the
fact that the images are fully searchable (as text) and can be marked and
pasted, as text, into other programs.
for instance, the Introduction, by the great archiviste Lucien Merlet, tells
us that
Le sixième volume de l'Inventaire sommaire des Archives départementales
d'Eure-et-Loir est entièrement consacré à l'inventaire des titres des
divers Chapitres du département. Parmi ceux-ci, il en est un qui domine tous
les autres, c'est celui de l'église cathédrale de Notre-Dame de Chartres.
i can see no errors in that bit of pasted text --but an OCRed product might
well have at least one.
the product is not perfect, however.
the bold font in which the dates of the individual laisse are given seems to
be a consistent problem, and there are ocassional errors:
11.4*-1398. — Sauvegarde par le roi de France Louis VII pour le village de
Bazoches-les-Hautes, Basoche, appartenant a l'eveque de Chartres. -7
Transaction entre l'eveque de Chartres et les archidiacres, a propos des
mariages, des sacrileges, des biens des intestats et des
amendes —
that "7" before "Transaction" is an inexplicable intrusion lacking in the
original and there is a superscripted footnote marker ("1") and a period after
"amendes" which are clear in the original but lacking in the pasted text.
font enhancements like bold and italics are kept and are visible when the text
is pasted into something like WORD --these are lost in the email program, of
course.
not perfect, but at least as good as i would expect to get with a good OCR
program.
searches work very well --but, again, not perfectly.
a search for the beauceron village of Berchères-l'Évésque (where the
quarries for the stone of the cathedral were) yields this interesting laisse:
G. 102. (Registre.) — In-folio, papier, 288 feuillets.
1515-1553. — Marché avec Gilles Merle, maître maçon à Chartres, pour les
réparations de la maison épiscopale de Berchères-l'Évésque.
note that the orthography of the original document is kept by Merlet, which
makes searching on the place a bit more "problematical."
but, the point here is that if one were interested in finding *all* of the
instances in which this village was mentioned in this inventory, such a search
would yield them with a 98% accuracy rate (or better).
before this technology such a task would have simply been impossible without
going through the whole thing, page by page (and there are 370 folio pages, in
this volume alone).
so, imHo, .pdf has *arrived* --the only disadvantage to this format that i can
see is the "weight" of the files (e.g., 29.3mb for this one).
but, you gets what you pay for.
and, in this case, you gets *a lot*.
>(And it only works on typescript, not hand lettering.)
actually, a good OCR program *can* be "trained" to read manuscript, providing
only that the "lettering" is *consistent* [which most middlevil scribal
product are].
the training feature is there to enable the softwhere to recognize unusual
fonts or oddities like the [consistently] broken type which one sees
ocassionally in older books, but basically any consistent "little dots on
paper" can be recognized, providing the user is possessed of near-Jobean
patience.
c
**********************************************************************
To join the list, send the message: join medieval-religion YOUR NAME
to: [log in to unmask]
To send a message to the list, address it to:
[log in to unmask]
To leave the list, send the message: leave medieval-religion
to: [log in to unmask]
In order to report problems or to contact the list's owners, write to:
[log in to unmask]
For further information, visit our web site:
http://www.jiscmail.ac.uk/lists/medieval-religion.html
|