Pioneer dataset and automatic recognition of Urdu handwritten characters using a deep autoencoder and convolutional neural network (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs42452-019-1914-1.pdf

Pioneer dataset and automatic recognition of Urdu handwritten characters using a deep autoencoder and convolutional neural network

Research Article Pioneer dataset and automatic recognition of Urdu handwritten characters using a deep autoencoder and convolutional neural network Hazrat Ali1 · Ahsan Ullah1 · Talha Iqbal2 · Shahid Khattak1 Received: 24 August 2019 / Accepted: 13 December 2019 / Published online: 3 January 2020 © Springer Nature Switzerland AG 2020 Abstract Automatic recognition of Urdu handwritten digits and characters, is a challenging task. It has applications in postal address reading, bank’s cheque processing, and digitization and preservation of handwritten manuscripts from old ages. While there exists a significant work for automatic recognition of handwritten English characters and other major languages of the world, the work done for Urdu language is extremely insufficient. This paper has two goals. Firstly, we introduce a pioneer dataset for handwritten digits and characters of Urdu, containing samples from more than 900 individuals. Secondly, we report results for automatic recognition of handwritten digits and characters as achieved by using deep auto-encoder network and convolutional neural network. More specifically, we use a two-layer and a three-layer deep autoencoder network and convolutional neural network and evaluate the two frameworks in terms of recognition accuracy. The proposed framework of deep autoencoder can successfully recognize digits and characters with an accuracy of 97% for digits only, 81% for characters only and 82% for both digits and characters simultaneously. In comparison, the framework of convolutional neural network has accuracy of 96.7% for digits only, 86.5% for characters only and 82.7% for both digits and characters simultaneously. These frameworks can serve as baselines for future research on Urdu handwritten text. Keywords Autoencoder · Convolutional neural network · Urdu · Text recognition 1 Introduction Handwritten text recognition is an interesting task due to its tremendous applications such as to convert handwritten documents into a digital format, reading house numbers automatically, postal address reading and robotics [1–5]. Unlike a typical text in one single font, handwritten text recognition is challenging due to the fact that writing styles vary from person to person. The Urdu language carries extreme importance as one of the largest languages of the world and the national language of Pakistan. Urdu text shares similarities with Arabic and Persian text. This work presents a framework for automatic recognition of Urdu handwritten letters. The task is less explored for Urdu. One primary reason that there has been no dataset available for Urdu handwritten text. To address this, we introduce a new dataset of Urdu handwritten digits and characters. The motivation comes from the fact that a standard dataset of Urdu handwritten text does not exist, which may serve as a baseline for research work. Urdu is one of the largest languages of the world, being the first language of more than 60 million people (and more than 329 million people if combined with Hindi as the two languages are greatly the same in spoken form). * Hazrat Ali, ; Ahsan Ullah, ; Talha Iqbal, ; Shahid Khattak, | 1Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan. 2Lambe Institute of Translational Research, National University of Ireland, Galway, Ireland. SN Applied Sciences (2020) 2:152 | https://doi.org/10.1007/s42452-019-1914-1 Vol.:(0123456789) Research Article SN Applied Sciences (2020) 2:152 | https://doi.org/10.1007/s42452-019-1914-1 Unfortunately, there seems to be very less or no work on Urdu language processing mainly due to unavailability of language resource. Besides, a standard dataset would help out the research community as unlike English and many other languages, Urdu text recognition is more challenging due to the presence of diacritics. Similar (but not the same) diacritics are found in Arabic and Persian languages, and thus, any research development on Urdu text recognition would eventually ease out progress in research work on handwritten text recognition of many more languages. While there has been the UCOM dataset [6] reported for Urdu text, several differences exist between the UCOM dataset and our dataset. Firstly, the UCOM offline dataset has been developed for continuous text of Urdu. Our dataset is for isolated characters of Urdu hand-written text. Secondly, the UCOM dataset, as described by the authors in [6], contains text for 600 pages of Urdu text and the number of different individuals who have written the text is limited to 100, while our dataset contains text from 900 individuals. Thirdly, The UCOM dataset contains text in Nasta’liq style only while our dataset contains hand-written samples in different styles and variations, thus covering a more diverse range of writing (font) styles. Deep learning (a sub branch of machine learning) algorithms have been popular for automatic recognition of digits and characters of different languages. Deep networks can be trained in supervised fashion requiring labels, or in an unsupervised way without requirements of labels [7–9]. In this work, we use an autoencoder network and a convolutional neural network (CNN) trained with 85% portion of the dataset and tested with the remaining 15% of the data. Moreover, these models are evaluated for configuration with two hidden layers and three hidden layers. The rest of the paper is organized as follows. Section 2 provides literature review on existing work done for Urdu text recognition. In Sect. 3, we describe the dataset developed, source of the data, pre-processing and segmentation steps. We describe the use of a deep autoencoder network and CNN in Sect. 4. Results are presented in Sect. 5 and finally; the paper is concluded in Sect. 6. 2 Literature review For character recognition, machine learning techniques such as deep neural network and CNN have been used. Arnold et al., used neural networks for character recognition [10]. Similarly in [11, 12], CNN has been used for Chinese characters recognition. A stacked denoising autoencoder has been used in [13] for offline Urdu character recognition. However, the work in [13] is limited to Vol:.(1234567890) optical character recognition of Nastaliq fonts only. Hussain et al., proposed an offline OCR system to recognize only eight Arabic handwritten characters with accuracy rate of 77.25% [14]. The framework proposed by Elenwar et al. [15] used Arabic characters database containing 1814 characters for training and 435 characters for testing. The database used in [16] is prepared by only four writers leading to low generalization. A database for Arabic characters is presented in [17] in which the authors performed preprocessing steps to avoid noise in the printed database. Another database for Arabic characters consists of 28 thousand characters of Arabic language written by 100 different writers [18]. A similar work has bee (...truncated)