Farsi document image recognition system using word layout signature (pdf)

Article PDF cannot be displayed. You can download it here:

http://dergipark.org.tr/download/article-file/731585

Farsi document image recognition system using word layout signature

Turkish Journal of Electrical Engineering & Computer Sciences http://journals.tubitak.gov.tr/elektrik/ Research Article Turk J Elec Eng & Comp Sci (2019) 27: 1477 – 1488 © TÜBİTAK doi:10.3906/elk-1804-92 Farsi document image recognition system using word layout signature 1 Cem ERGÜN1∗,, Sajedeh NOROZPOUR2 , Department of Computer Engineering, Faculty of Engineering, Eastern Mediterranean University, Famagusta, Northern Cyprus 2 Department of Mathematics, Faculty of Art and Sciences, Near East University, Nicosia, Northern Cyprus Received: 12.04.2018 • Accepted/Published Online: 29.01.2019 • Final Version: 22.03.2019 Abstract: In this paper, a new representation of Farsi words is proposed to present the keyword spotting problems in Farsi document image retrieval. In this regard, we define a signature for each Farsi word based on the word connected component layout. The mentioned signature is shown as boxes, and then, by sketching vertical and horizontal lines, we construct a grid of each word to provide a new descriptor. One of the advantages of this method is that it can be used for both handwritten and machine-printed texts. Finally, to evaluate the performance of our system in comparison to other methods, a database that contains 19,582 printed Farsi words is examined, and after applying this approach, a recall rate of 98.1 % and a precision rate of 94.3 % are obtained. Key words: Farsi document image retrieval, word spotting, word layout signature, optical character recognition 1. Introduction Due to the increase in digital libraries and paper documents in offices, their organization and management now take significant amounts of time and energy. This problem appears more often when a specific document among a huge volume of documents is needed. In order to solve such difficulties, paper documents have to be scanned and archived; then, to find a specific document that is needed, some methods are established. This process is called document image retrieval, which has been a hot topic in recent years. To search for a keyword in document images, first of all, by optical character recognition (OCR), we have to convert the format of document images from pictorial format to text format, which is translatable by the machine [1], and then by the use of the traditional methods of document retrieval, the target word is sought in the text. Although OCR is frequently used by researchers in this area, it has some disadvantages that cause OCR to be inappropriate in all retrieval cases. The most important of these disadvantages is that it costs a lot in converting huge amounts of documents and also it is not sufficiently successful in applying it on low quality texts and documents with complicated layout. Additionally, there is no robust OCR method available yet for Farsi language scripts [2, 3]. In order to overcome these problems, researchers suggested another method for document image retrieval that is called keyword spotting or, more simply, word spotting [4]. Historically, word spotting was first defined in the context of speech processing [5–7]; later on, it was also developed in the context of document image processing in machine-printed texts [8–10]. In document image processing, keyword spotting system gives a “yes” or “no” answer to the user’s query by spotting the keyword without doing any letter recognition [11, 12]. ∗ Correspondence: 1477 This work is licensed under a Creative Commons Attribution 4.0 International License. ERGÜN and NOROZPOUR/Turk J Elec Eng & Comp Sci In recent years, much research has been done in the field of keyword spotting in document images, mostly for the English language with Latin letters, and some work has been done on Korean [13], Chinese [14], and Arabic [15–17] languages, as well. So far, there are few papers related to keyword spotting in the Farsi language. For example, in [18, 19], a system was presented for machine-printed image retrieval of a Farsi word. The main idea used is based on font recognition of document images and the correction is done on the font face and the font size of the query word according to the document’s image of the keyword before searching. The similarity between the user’s query and images involved in document images is done based on the XNOR similarity measure. Then the topological features of the image, such as the number of holes, number of ascenders/descenders, and number of dots, are used to improve the results. The method is based on pixel resolution and is limited to training fonts. This means that it does not have the capability of extending to more font faces and also has an extra step to recognize the font size, which has a heavy computational load for the system. In [20], by using Farsi topology features such as number of dots, number of subwords, and number of holes, a new way of coding and retrieval of Farsi document images was shown. The work in [20] also contains a way to detect fonts in Farsi texts, which is based on tiny connected components. In another paper published by Ebrahimi and Kabir [21], a method based on the whole shape of words and subwords was presented. Here, principal component analysis (PCA) is used for compressing feature vectors. Then k-means is used for clustering of subwords and the average of each cluster is placed in one pictorial dictionary. Furthermore, an interesting method for retrieval of Farsi document images was introduced in [22], which is independent of recognition. Here, the upper contours of words are extracted and then a picture dictionary of these features is made, and each subword is shown as a combination of contour strokes that includes upper, lower, and middle positions of the baseline. As another example, the work proposed in [23] depends on the feature of the shape of printed words in the recognition of Arabic texts written in three different fonts, two of which are synthetic. Several features such as dots, directional segments, directional cavities, junctions and endpoints, connectors, inner word spaces, and descenders of the Arabic printed words are extracted and saved in a dictionary. The proposed method published recently in [24] determined the ratio of the subword width to the subword height and confined the search range to this ratio. This ratio is calculated according to the symbol positions on a pixel by pixel basis. The large number of subwords is the disadvantage of this method. As was mentioned earlier, most of the studies on this topic were done in the English language, and we will use some of them in this paper. For instance, a method of retrieval of English document images that is based on word shape coding was done in [25]. In that method, the authors used topological features such as character holes, ascenders/descenders, and character reservoirs. The impressive point of this method is that documents can be retrieved by word shape coding based on both the query of the keyword and the query of the document image. The a (...truncated)