Loading…
A novel normal to tangent line (NTL) algorithm for scale invariant feature extraction for Urdu OCR
The font invariant recognition of Urdu optical characters is a difficult task due to the nature of Nastalique script. Urdu Nastalique is a complex script as it is excessively cursive and contains characters which are overlapping. Characters also change shape along with change in context. The identif...
Saved in:
Published in: | International journal on document analysis and recognition 2022-03, Vol.25 (1), p.51-66 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The font invariant recognition of Urdu optical characters is a difficult task due to the nature of Nastalique script. Urdu Nastalique is a complex script as it is excessively cursive and contains characters which are overlapping. Characters also change shape along with change in context. The identification of starting position of same character in different contexts further increases complexity. Hence, an optical character recognition (OCR) system, which is trained to recognize characters of a particular font size, may not show the same level of accuracy if font size varies. While considering this complexity the current research has focused on discovering such a feature set which may provide sufficient information for scale invariant Urdu optical character recognition. For this task, calligraphic properties of Urdu Nastalique, the thickness of ligature, the direction of movement of calligraphic pen and global geometric features (height and weight) are used as feature set. The feature of thickness is extracted using two novel algorithms, i.e. “Normal to Tangent Line Algorithm (NTL)” and “Angle to Tangent Line Algorithm (ATL)”. These features are fed to three different models, i.e. correlation, C4.5 and feedforward artificial neural network, and the performance of these models is also compared with SIFT (Scale Invariant Features Transformation). For training and testing, both real and fabricated data sets are employed. The new benchmark dataset of extracted features named Urdu OCR—Scale Invariant Feature Vectors (SIFVs), is developed and released at Kaggle. The newly developed SIFVs dataset, when used to train Correlation, C4.5 and ANN-based models, outperformed SIFT descriptors and yielded 94.56%, 90.54% and 94.65% accuracy, respectively, while SIFT descriptors achieved only 75.45% accuracy on average. |
---|---|
ISSN: | 1433-2833 1433-2825 |
DOI: | 10.1007/s10032-021-00389-x |