OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment
Journal of Computational Social Science
https://doi.org/10.1007/s42001-021-00149-1
RESEARCH ARTICLE
OCR with Tesseract, Amazon Textract, and Google
Document AI: a benchmarking experiment
Thomas Hegghammer1
Received: 23 June 2021 / Accepted: 6 October 2021
© The Author(s) 2021
Abstract
Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This
article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans
(n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI)
performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative
performance of three leading OCR products and the differential effects of commonly
found noise types can help scholars identify better OCR solutions for their research
needs. The test materials have been preserved in the openly available “Noisy OCR
Dataset” (NOD) for reuse in future benchmarking studies.
Keywords OCR · Cloud computing · Benchmarking
I am grateful to the three anonymous reviewers and to Neil Ketchley for valuable comments. I also
thank participants in the University of Oslo Political Data Science seminar on 17 June 2021 for
inputs and suggestions, as well as Eddie Antonio Santos for helping solve technical questions related
to the ISRI/Ocreval tool. Supplementary information and replication materials are available at
https://github.com/Hegghammer/noisy-ocr-benchmark.
* Thomas Hegghammer
1
Norwegian Defence Research Establishment (FFI), Kjeller, Norway
13
Vol.:(0123456789)
Journal of Computational Social Science
Introduction
Few technologies hold as much promise for the social sciences and humanities as
optical character recognition (OCR). Automated text extraction from digital images
can open up large quantities of understudied historical documents to computational
analysis, potentially generating deep new insights into the human past.
But OCR is a technology still in the making, and available software provides varying levels of accuracy. The best results are usually obtained with a tailored solution involving corpus-specific pre-processing, model training, or postprocessing, but
such procedures can be labour-intensive.1 Pre-trained, general OCR processors have
a much higher potential for wide adoption in the scholarly community, and hence
their out-of-the box performance is of scientific interest.
For long, general OCR processors such as Tesseract ([27, 38]) only delivered perfect results under what we may call laboratory conditions, i.e., on noise-free, single-column text in a clear printed font. This limited their utility for real-life historical documents, which often contain shading, blur, shine-through, stains, skewness,
complex layouts, and other things that produce OCR error. Historically, general
OCR processors have also struggled with non-Western languages ([16]), rendering
them less useful for the many scholars working on documents in such languages.
In the past decade, advances in machine learning have led to substantial improvements in standalone OCR processor performance. Moreover, the past 2 years have
seen the arrival of server-based processors such as Amazon Textract and Google
Document AI, which offer document processing via an application processing interface (API) ([43]). Media and blog coverage indicate that these processors deliver
strong out-of-the-box performance2, but those tests usually involve a small number
of documents. Academic benchmarking studies exist ([37, 41]) but the predate the
server-based processors.
To find out, I conducted a benchmarking experiment comparing the performance
of Tesseract, Textract, and Document AI on English and Arabic page scans. The
objective was to generate statistically meaningful measurements of the accuracy of
a selection of general OCR processors on document types commonly encountered in
social scientific and humanities research.
The exercise yielded specifications for the relative performance of three leading
OCR products as well as the differential effects of commonly found noise types. The
1
For pre-processing see, e.g, [3, 7, 13, 19, 42], and [44]. For model training, see, e.g., [4, 29, 33], and
[45]. For postprocessing, see, e.g., [17, 35], and [39].
2
See, for example, Ted Han and Amanda Hickman, “Our Search for the Best OCR Tool, and What
We Found,” OpenNews, February 19, 2019 (https://source.opennews.org/articles/so-many-ocr-options/);
Fabian Gringel, “Comparison of OCR tools: how to choose the best tool for your project,” Medium.com,
January 20, 2020 (https://medium.com/dida-machine-learning/comparison-of-ocr-tools-how-to-choose-
the-best-tool-for-your-project-bd21fb9dce6b); Manoj Kukreja, “Compare Amazon Textract with Tesseract OCR—OCR & NLP Use Case,” TowardDataScience.com, September 17, 2020 (https://towardsdat
ascience.com/compare-amazon-textract-with-tesseract-ocr-ocr-nlp-use-case-43ad7cd48748); Cem Dilmegani, “Best OCR by Text Extraction Accuracy in 2021,” AIMultiple.com, June 6, 2021 (https://resea
rch.aimultiple.com/ocr-accuracy/).
13
Journal of Computational Social Science
Table 1 Features of Tesseract, Textract, and Document AI
Name
Maintainer
Installation
Architecture Languages Cost
LSTM
Tesseract
Tesseract OCR Project
Local
Textract
Amazon Web Services
Server-based Undisclosed 6
116
Document AI Google Cloud Services Server-based Undisclosed 60+
Free
$1.50 per 1000 pages
$1.50 per 1000 pages
findings can help scholars identify better OCR solutions for their research needs.
The test materials, which have been preserved in the openly available “Noisy OCR
Dataset” (NOD), can be used in future research.
Design
The experiment involved taking two document collections of 322 English-language
and 100 Arabic-language page scans, replicating them 43 times with different types
of artificially generated noise, processing the full corpus of ~18,500 documents in
each OCR engine, and measuring the accuracy against ground truth using the Information Science Research Institute (ISRI) tool.
Processors
I chose Tesseract, Textract, and Document AI on the basis of their wide use, reputation for accuracy, and availability for programmatic use. Budget constraints prevented the inclusion of additional reputable processors such as Adobe PDF Services
and ABBYY Cloud OCR, but these can be tested in the future using the same procedure and test materials.3
A full description of these processors is beyond the scope of this article, but
Table 1 summarizes their main user-related features.4 All the processors ar (...truncated)