Loading…

A novel normal to tangent line (NTL) algorithm for scale invariant feature extraction for Urdu OCR

The font invariant recognition of Urdu optical characters is a difficult task due to the nature of Nastalique script. Urdu Nastalique is a complex script as it is excessively cursive and contains characters which are overlapping. Characters also change shape along with change in context. The identif...

Full description

Saved in:

Bibliographic Details
Published in:	International journal on document analysis and recognition 2022-03, Vol.25 (1), p.51-66
Main Authors:	Naseer, Asma, Hussain, Sarmad, Zafar, Kashif, Khan, Ayesha
Format:	Article
Language:	English
Subjects:	Accuracy Algorithms Artificial neural networks Complexity Computer Science Datasets Feature extraction Image Processing and Computer Vision Invariants Optical character recognition Optical properties Original Paper Pattern Recognition Thickness
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The font invariant recognition of Urdu optical characters is a difficult task due to the nature of Nastalique script. Urdu Nastalique is a complex script as it is excessively cursive and contains characters which are overlapping. Characters also change shape along with change in context. The identification of starting position of same character in different contexts further increases complexity. Hence, an optical character recognition (OCR) system, which is trained to recognize characters of a particular font size, may not show the same level of accuracy if font size varies. While considering this complexity the current research has focused on discovering such a feature set which may provide sufficient information for scale invariant Urdu optical character recognition. For this task, calligraphic properties of Urdu Nastalique, the thickness of ligature, the direction of movement of calligraphic pen and global geometric features (height and weight) are used as feature set. The feature of thickness is extracted using two novel algorithms, i.e. “Normal to Tangent Line Algorithm (NTL)” and “Angle to Tangent Line Algorithm (ATL)”. These features are fed to three different models, i.e. correlation, C4.5 and feedforward artificial neural network, and the performance of these models is also compared with SIFT (Scale Invariant Features Transformation). For training and testing, both real and fabricated data sets are employed. The new benchmark dataset of extracted features named Urdu OCR—Scale Invariant Feature Vectors (SIFVs), is developed and released at Kaggle. The newly developed SIFVs dataset, when used to train Correlation, C4.5 and ANN-based models, outperformed SIFT descriptors and yielded 94.56%, 90.54% and 94.65% accuracy, respectively, while SIFT descriptors achieved only 75.45% accuracy on average.
ISSN:	1433-2833 1433-2825
DOI:	10.1007/s10032-021-00389-x