August 8, 2015
Comments to [log in to unmask]
Federico Boschetti, CNR, Pisa
Gregory Crane, Leipzig/Tufts
Matt Munson, Leipzig/Tufts
Bruce Robertson, Mount Allison
Nick White, Durham (UK) (and Tufts during 2014)
A first stab at producing OCR-generated Greek and Latin for the complete Patrologia Graeca (PG) is now available on GitHub at https://github.com/OGL-PatrologiaGraecaDev. This release provides raw textual data that will be of service to those with programming expertise and to developers with an interest in Ancient Greek and Latin. The Patrologia Graeca has as much as 50 million words of Ancient Greek produced over more than 1,000 years, along with an even larger amount of scholarship and accompanying translations in Latin.
Matt
Munson started a new organization for this data because it is
simply too large to put into
the
existing OGL organization. Each volume can contain 250MB or
more of .txt and .hocr files, so it is impossible to put
everything in one repository or even several dozen
repositories. So he decided to create a new organization where
all the OCR results for each volume would be contained within
its own repository. This will also allow us to add more OCR
data as necessary (e.g., from Bruce Robertson, of Mt. Allison
University, or from nidaba, our own OCR pipeline) at the volume
level.
The
repositories are being created and populated automatically by a
Python script, so if you notice any problems or strange
happenings, please let us know either by opening an issue on
the individual volume repository or by sending us an email.
This is our first attempt at pushing
this
data out. Please let us know what you think.
Available data includes:
Greek
and Latin text generated by two open source OCR engines,
OCRopus (https://github.com/tmbdev/ocropy)
and Tesseract (https://github.com/tesseract-ocr).
For work done optimizing OCRopus, see http://heml.mta.ca/lace.
For work done optimizing Tesseract, see http://ancientgreekocr.org/.
The output format for both engines in hOCR (https://en.wikipedia.org/wiki/HOCR),
a format that contains links to the coordinates on the
original page image from which the OCR was generated.
OCR
results for as many scans of each volume of the Patrologia
Graeca that we could find in the HathiTrust. We discovered
that the same OCR engine applied to scans of different
copies of the same book would generate different errors
(even when the scans seemed identical to most human
observers). This means that if OCR applied to copy X
incorrectly analyzed a particular word, there was a good
chance that the same word would be correctly analyzed when
the OCR engine was applied to copy Y. A preliminary study of
this phenomenon is available here:
http://tinyurl.com/ppyfdfj.
In most cases, the OCRopus/Lace OCR contains results for
four different scanned copies while the
Tesseract/AncientGreekOCR output contains results for up to
10 different copies. All of the Patrologia Graeca volumes
are old enough that HathiTrust members in Europe and North
America can download the PDFs for further analysis. Anyone
should be able to see the individual pages used for OCR via
the public HathiTrust interface.
Initial page-level metadata for the various authors and works in the PG, derived from the core index at columns 13-114 of Cavallera’s 1912 index to the PG (which Roger Pearse cites at http://www.roger-pearse.com/weblog/patrologia-graeca-pg-pdfs/). A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection at: https://www.dropbox.com/s/mldhu4okpq4i7r8/pg_index2.xml.. All figures are preliminary and subject to modification (that is one motivation for posting this call for help), but we do not expect that the figures will change much at this point. At present, we have identified 658 authors and 4,287 works. The PG contains extensive introductions, essays, indices etc. and we have tried to separate these out by scanning for keywords (e.g., praefatio, monitum, notitia, index). We estimate that there are 204,129 columns of source text and 21,369 columns of secondary sources, representing roughly 90% and 10% respectively. Since a column in Migne contains about 500 words and since the Greek texts (almost) always have accompanying Latin translations, the PG contains up to 50 million words of Greek text but many authors have extensive Latin notes and in some cases no Greek text, so there should be even more Latin. For more information, look here: http://tinyurl.com/ppyfdfj.
Next Steps
Developing
high-recall searching by combining the results for each
scanned page of the PG.
This entails several steps. First, we need to align the OCR
pages with each other -- page 611 for one volume may
correspond may correspond to page 605 in another, depending
upon how the front matter is treated and upon pages that one
scan may have missed. Second, we need to create an index of
all forms in the OCR-generated text available for each page
in each PG volume. Since one of the two OCR engines applied
to multiple scans of the same page is likely to produce a
correct transcription, a unified index for all the text for
all the scans of a page will capture a very high percentage
of the words on that page.
Running
various forms of text mining and analysis over the PG.
Many text mining and analysis techniques work by counting
frequently repeated features. Such techniques can be
relatively insensitive to error rates in the OCR (i.e., you
get essentially the same results if your texts is 96%
accurate or if your texts are 99.99% accurate). Many methods
for topic modelling and stylistic analysis should produce
immediately useful results.
Using
the multiple scans to identify and correct errors and to
create a single optimized transcription. In most case, bad
OCR produces nonsense forms that are not legal Greek or
Latin.
When one OCR run has a valid Greek or Latin word and others
do not, that valid word is usually correct. Where two
different scans produce valid Greek or Latin words (e.g.,
the common confusion of eum
and cum),
we can use the hOCR feature that allows us to include
multiple possibilities. We can do quite a bit encoding the
confidence that we have in the accuracy of each transcribed
word.