자료실         일반자료실

High-quality OCR of polytonic, or 'ancient', Greek

조회수 : 1212

 

1170 volumes: High-quality OCR of polytonic, or 'ancient', Greek

Posted: 24 Jun 2017 03:14 PM PDT

 [First posted in AWOL 13 December 2013, updated 24 June 2017]


Lace: Greek OCR 

Overview

This site catalogues the results of our 2012/13 campaign to produce high-quality OCR of polytonic, or 'ancient', Greek texts in a HPC environment. It comprises over 600 volumes from archive.org and from original scans. There are over 6 million pages of OCR output in total, including experimental and rejected results.
Results are presented in a hierarchical organization, beginning with the archive.org volume identifier. Each of these are associated with one or more 'runs', or attempts at OCRing this volume. A run has a date stamp and is associated with a classifier and an aggregate best b-score (roughly indicating quality of Greek output.) Each run produces various kinds of output:
  1. raw hocr output: the data generated by our OCR process, usually with multiple copies for each page, rendered at a range of binarization thresholds
  2. selected hocr output: a filtered version of the data in (1), with each page image represented by a single, best, output page
  3. blended hocr output: the data in (2), but replaced with the corresponding words from the raw output in (1), should the selected page not comprise a dictionary word and one of the raw pages comprises one.
  4. selected hocr output spellchecked: the data in (3) processed through a weighted levenshtein distance spellchecking algorithm that is meant to correct simple OCR errors
  5. combined hocr output: where archive.org provides OCR output for Latin script (not Greek), this final step pieces together the data in (4) with archive's output, preferring archive's output where our output suggests that the data is Latin. If archive.org provides Greek output, this step is no different from (4)

Code

All code and classifiers for Rigaudon are posted in a github repository. This holds the modified Gamera source code, ancillary python scripts such as the spellcheck engine, and the bash scripts that coordinate the process in a HPC environment through Sun Grid Engine.
Details of its operation are outlined in a white paper.
Our July 2013 presentation at the London Digital Classicist seminar series is available online from the Institue of Classical Studies.

Context

This is a continuation of efforts begun through the Digging Into Data Round I project Toward Dynamic Variorum Editions, in which -- as the project white paper notes -- we discovered both the tantalizing potential of Greek OCR and the poor results that OCR engines at that time produced when operating at scale.
In order to bootstrap that process, we adapted the most extensible and successful of the frameworks to that date, the Gamera Greek OCR engine by Dalitz and Brandt. Using the AceNET HPC environment we analyzed a sample of the Google Greek and Latin corpus with twenty classifiers composed by Canadian undergraduate students. From this, we produced a quantitative report on the efficacy of our modified OCR code.
On the basis of this work, we received a 2012/2013 Humanities Computing Grant from Compute Canada, making this large-scale processing possible. 

Support

This work has benefited from the support of:
  • NEH, JISC, SSHRC, though Digging into Data I
  • Compute Canada, which provided the use of a dedicated machine
  • ILC-CNR, Pisa, which facilitated meetings
  • Greg Crane, whose supportiveness is as unbounded as his enthusiasm
Texts
Latest
FAQ
Gallery 
Guide 
About Us
 

1170 Texts

 

목록