OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s42001-021-00149-1.pdf

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Journal of Computational Social Science https://doi.org/10.1007/s42001-021-00149-1 RESEARCH ARTICLE OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment Thomas Hegghammer1 Received: 23 June 2021 / Accepted: 6 October 2021 © The Author(s) 2021 Abstract Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies. Keywords OCR · Cloud computing · Benchmarking I am grateful to the three anonymous reviewers and to Neil Ketchley for valuable comments. I also thank participants in the University of Oslo Political Data Science seminar on 17 June 2021 for inputs and suggestions, as well as Eddie Antonio Santos for helping solve technical questions related to the ISRI/Ocreval tool. Supplementary information and replication materials are available at https://github.com/Hegghammer/noisy-ocr-benchmark. * Thomas Hegghammer 1 Norwegian Defence Research Establishment (FFI), Kjeller, Norway 13 Vol.:(0123456789) Journal of Computational Social Science Introduction Few technologies hold as much promise for the social sciences and humanities as optical character recognition (OCR). Automated text extraction from digital images can open up large quantities of understudied historical documents to computational analysis, potentially generating deep new insights into the human past. But OCR is a technology still in the making, and available software provides varying levels of accuracy. The best results are usually obtained with a tailored solution involving corpus-specific pre-processing, model training, or postprocessing, but such procedures can be labour-intensive.1 Pre-trained, general OCR processors have a much higher potential for wide adoption in the scholarly community, and hence their out-of-the box performance is of scientific interest. For long, general OCR processors such as Tesseract ([27, 38]) only delivered perfect results under what we may call laboratory conditions, i.e., on noise-free, single-column text in a clear printed font. This limited their utility for real-life historical documents, which often contain shading, blur, shine-through, stains, skewness, complex layouts, and other things that produce OCR error. Historically, general OCR processors have also struggled with non-Western languages ([16]), rendering them less useful for the many scholars working on documents in such languages. In the past decade, advances in machine learning have led to substantial improvements in standalone OCR processor performance. Moreover, the past 2 years have seen the arrival of server-based processors such as Amazon Textract and Google Document AI, which offer document processing via an application processing interface (API) ([43]). Media and blog coverage indicate that these processors deliver strong out-of-the-box performance2, but those tests usually involve a small number of documents. Academic benchmarking studies exist ([37, 41]) but the predate the server-based processors. To find out, I conducted a benchmarking experiment comparing the performance of Tesseract, Textract, and Document AI on English and Arabic page scans. The objective was to generate statistically meaningful measurements of the accuracy of a selection of general OCR processors on document types commonly encountered in social scientific and humanities research. The exercise yielded specifications for the relative performance of three leading OCR products as well as the differential effects of commonly found noise types. The 1 For pre-processing see, e.g, [3, 7, 13, 19, 42], and [44]. For model training, see, e.g., [4, 29, 33], and [45]. For postprocessing, see, e.g., [17, 35], and [39]. 2 See, for example, Ted Han and Amanda Hickman, “Our Search for the Best OCR Tool, and What We Found,” OpenNews, February 19, 2019 (https://source.opennews.org/articles/so-many-ocr-options/); Fabian Gringel, “Comparison of OCR tools: how to choose the best tool for your project,” Medium.com, January 20, 2020 (https://medium.com/dida-machine-learning/comparison-of-ocr-tools-how-to-choose- the-best-tool-for-your-project-bd21fb9dce6b); Manoj Kukreja, “Compare Amazon Textract with Tesseract OCR—OCR & NLP Use Case,” TowardDataScience.com, September 17, 2020 (https://towardsdat ascience.com/compare-amazon-textract-with-tesseract-ocr-ocr-nlp-use-case-43ad7cd48748); Cem Dilmegani, “Best OCR by Text Extraction Accuracy in 2021,” AIMultiple.com, June 6, 2021 (https://resea rch.aimultiple.com/ocr-accuracy/). 13 Journal of Computational Social Science Table 1 Features of Tesseract, Textract, and Document AI Name Maintainer Installation Architecture Languages Cost LSTM Tesseract Tesseract OCR Project Local Textract Amazon Web Services Server-based Undisclosed 6 116 Document AI Google Cloud Services Server-based Undisclosed 60+ Free $1.50 per 1000 pages $1.50 per 1000 pages findings can help scholars identify better OCR solutions for their research needs. The test materials, which have been preserved in the openly available “Noisy OCR Dataset” (NOD), can be used in future research. Design The experiment involved taking two document collections of 322 English-language and 100 Arabic-language page scans, replicating them 43 times with different types of artificially generated noise, processing the full corpus of ~18,500 documents in each OCR engine, and measuring the accuracy against ground truth using the Information Science Research Institute (ISRI) tool. Processors I chose Tesseract, Textract, and Document AI on the basis of their wide use, reputation for accuracy, and availability for programmatic use. Budget constraints prevented the inclusion of additional reputable processors such as Adobe PDF Services and ABBYY Cloud OCR, but these can be tested in the future using the same procedure and test materials.3 A full description of these processors is beyond the scope of this article, but Table 1 summarizes their main user-related features.4 All the processors ar (...truncated)