Pioneer dataset and automatic recognition of Urdu handwritten characters using a deep autoencoder and convolutional neural network
Research Article
Pioneer dataset and automatic recognition of Urdu handwritten
characters using a deep autoencoder and convolutional neural
network
Hazrat Ali1
· Ahsan Ullah1 · Talha Iqbal2 · Shahid Khattak1
Received: 24 August 2019 / Accepted: 13 December 2019 / Published online: 3 January 2020
© Springer Nature Switzerland AG 2020
Abstract
Automatic recognition of Urdu handwritten digits and characters, is a challenging task. It has applications in postal
address reading, bank’s cheque processing, and digitization and preservation of handwritten manuscripts from old
ages. While there exists a significant work for automatic recognition of handwritten English characters and other major
languages of the world, the work done for Urdu language is extremely insufficient. This paper has two goals. Firstly,
we introduce a pioneer dataset for handwritten digits and characters of Urdu, containing samples from more than 900
individuals. Secondly, we report results for automatic recognition of handwritten digits and characters as achieved
by using deep auto-encoder network and convolutional neural network. More specifically, we use a two-layer and a
three-layer deep autoencoder network and convolutional neural network and evaluate the two frameworks in terms of
recognition accuracy. The proposed framework of deep autoencoder can successfully recognize digits and characters
with an accuracy of 97% for digits only, 81% for characters only and 82% for both digits and characters simultaneously.
In comparison, the framework of convolutional neural network has accuracy of 96.7% for digits only, 86.5% for characters only and 82.7% for both digits and characters simultaneously. These frameworks can serve as baselines for future
research on Urdu handwritten text.
Keywords Autoencoder · Convolutional neural network · Urdu · Text recognition
1 Introduction
Handwritten text recognition is an interesting task due to
its tremendous applications such as to convert handwritten documents into a digital format, reading house numbers automatically, postal address reading and robotics
[1–5]. Unlike a typical text in one single font, handwritten
text recognition is challenging due to the fact that writing
styles vary from person to person.
The Urdu language carries extreme importance as one
of the largest languages of the world and the national language of Pakistan. Urdu text shares similarities with Arabic
and Persian text. This work presents a framework for automatic recognition of Urdu handwritten letters. The task is
less explored for Urdu. One primary reason that there has
been no dataset available for Urdu handwritten text. To
address this, we introduce a new dataset of Urdu handwritten digits and characters. The motivation comes from
the fact that a standard dataset of Urdu handwritten text
does not exist, which may serve as a baseline for research
work. Urdu is one of the largest languages of the world,
being the first language of more than 60 million people
(and more than 329 million people if combined with Hindi
as the two languages are greatly the same in spoken form).
* Hazrat Ali, ; Ahsan Ullah, ; Talha Iqbal, ; Shahid Khattak,
| 1Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus,
Abbottabad, Pakistan. 2Lambe Institute of Translational Research, National University of Ireland, Galway, Ireland.
SN Applied Sciences (2020) 2:152 | https://doi.org/10.1007/s42452-019-1914-1
Vol.:(0123456789)
Research Article
SN Applied Sciences (2020) 2:152 | https://doi.org/10.1007/s42452-019-1914-1
Unfortunately, there seems to be very less or no work on
Urdu language processing mainly due to unavailability of
language resource. Besides, a standard dataset would help
out the research community as unlike English and many
other languages, Urdu text recognition is more challenging due to the presence of diacritics. Similar (but not the
same) diacritics are found in Arabic and Persian languages,
and thus, any research development on Urdu text recognition would eventually ease out progress in research work
on handwritten text recognition of many more languages.
While there has been the UCOM dataset [6] reported for
Urdu text, several differences exist between the UCOM
dataset and our dataset. Firstly, the UCOM offline dataset has been developed for continuous text of Urdu. Our
dataset is for isolated characters of Urdu hand-written text.
Secondly, the UCOM dataset, as described by the authors
in [6], contains text for 600 pages of Urdu text and the
number of different individuals who have written the text
is limited to 100, while our dataset contains text from 900
individuals. Thirdly, The UCOM dataset contains text in
Nasta’liq style only while our dataset contains hand-written samples in different styles and variations, thus covering a more diverse range of writing (font) styles.
Deep learning (a sub branch of machine learning) algorithms have been popular for automatic recognition of digits and characters of different languages. Deep networks
can be trained in supervised fashion requiring labels, or in
an unsupervised way without requirements of labels [7–9].
In this work, we use an autoencoder network and a convolutional neural network (CNN) trained with 85% portion of
the dataset and tested with the remaining 15% of the data.
Moreover, these models are evaluated for configuration
with two hidden layers and three hidden layers.
The rest of the paper is organized as follows. Section 2
provides literature review on existing work done for Urdu
text recognition. In Sect. 3, we describe the dataset developed, source of the data, pre-processing and segmentation steps. We describe the use of a deep autoencoder network and CNN in Sect. 4. Results are presented in Sect. 5
and finally; the paper is concluded in Sect. 6.
2 Literature review
For character recognition, machine learning techniques
such as deep neural network and CNN have been used.
Arnold et al., used neural networks for character recognition [10]. Similarly in [11, 12], CNN has been used for
Chinese characters recognition. A stacked denoising
autoencoder has been used in [13] for offline Urdu character recognition. However, the work in [13] is limited to
Vol:.(1234567890)
optical character recognition of Nastaliq fonts only. Hussain et al., proposed an offline OCR system to recognize
only eight Arabic handwritten characters with accuracy
rate of 77.25% [14]. The framework proposed by Elenwar
et al. [15] used Arabic characters database containing 1814
characters for training and 435 characters for testing. The
database used in [16] is prepared by only four writers leading to low generalization. A database for Arabic characters
is presented in [17] in which the authors performed preprocessing steps to avoid noise in the printed database.
Another database for Arabic characters consists of 28
thousand characters of Arabic language written by 100
different writers [18]. A similar work has bee (...truncated)