Open Patrologia Graeca 1.0


August 8, 2015

http://tinyurl.com/nlvhy9b

Comments to [log in to unmask]


Federico Boschetti, CNR, Pisa

Gregory Crane, Leipzig/Tufts

Matt Munson, Leipzig/Tufts

Bruce Robertson, Mount Allison

Nick White, Durham (UK) (and Tufts during 2014)


A first stab at producing OCR-generated Greek and Latin for the complete Patrologia Graeca  (PG) is now available on GitHub at https://github.com/OGL-PatrologiaGraecaDev. This release provides raw textual data that will be of service to those with programming expertise and to developers with an interest in Ancient Greek and Latin. The Patrologia Graeca has as much as 50 million words of Ancient Greek produced over more than 1,000 years, along with an even larger amount of scholarship and accompanying translations in Latin.


Matt Munson started a new organization for this data because it is simply too large to put into
the existing OGL organization.  Each volume can contain 250MB or more of .txt and .hocr files, so it is impossible to put everything in one  repository or even several dozen repositories. So he decided to create  a new organization where all the OCR results for each volume would be  contained within its own repository.  This will also allow us to add more OCR data as necessary (e.g., from Bruce Robertson, of Mt. Allison University, or from nidaba, our own OCR pipeline) at the volume level.


The repositories are being created and populated automatically by a  Python script, so if you notice any problems or strange happenings,  please let us know either by opening an issue on the individual volume repository or by sending us an email.  This is our first attempt at pushing
this data out.  Please let us know what you think.


Available data includes:



Next Steps


  1. Developing high-recall searching by combining the results for each scanned page of the PG. This entails several steps. First, we need to align the OCR pages with each other -- page 611 for one volume may correspond may correspond to page 605 in another, depending upon how the front matter is treated and upon pages that one scan may have missed. Second, we need to create an index of all forms in the OCR-generated text available for each page in each PG volume. Since one of the two OCR engines applied to multiple scans of the same page is likely to produce a correct transcription, a unified index for all the text for all the scans of a page will capture a very high percentage of the words on that page.

  2. Running various forms of text mining and analysis over the PG. Many text mining and analysis techniques work by counting frequently repeated features. Such techniques can be relatively insensitive to error rates in the OCR (i.e., you get essentially the same results if your texts is 96% accurate or if your texts are 99.99% accurate). Many methods for topic modelling and stylistic analysis should produce immediately useful results.

  3. Using the multiple scans to identify and correct errors and to create a single optimized transcription. In most case, bad OCR produces nonsense forms that are not legal Greek or Latin. When one OCR run has a valid Greek or Latin word and others do not, that valid word is usually correct. Where two different scans produce valid Greek or Latin words (e.g., the common confusion of eum and cum), we can use the hOCR feature that allows us to include multiple possibilities. We can do quite a bit encoding the confidence that we have in the accuracy of each transcribed word.

Providing a public error correction interface. One error correction interface already does exist and has been used to correct millions of words of OCR-generated Greek but two issues face us. First, we need to address the fact that we cannot ourselves serve page images from HathiTrust scans. HathiTrust members could use the system that we have by downloading the scans of the relevant volumes to their own servers but that does not provide a general solution. Second, our correction environment deals with OCR for one particular scanned copy. Ideally, the correction environment would allow readers to draw upon the various different scans from different copies and different OCR engines.