ISSUES AND CHALLENGES IN SINDHI OCR

D. N. HAKRO, I. A. ISMAILI, A. Z. TALIB, Z. BHATTI, G. N. MOJAI

Abstract


Optical Character Recognition (OCR) is the reading (recognition) of a written or printed document. Many of the languages are enriched with the OCR but OCR is lacking in Sindhi Language which has a golden 5000 year history. OCRs for some of the languages including Latin script and some other languages with isolated characters (non-cursive) are easy to develop whereas developing an OCR for a cursive language and a language possessing a large set of characters such as Sindhi is a challenging job. Sindhi Language has 52 characters as compared to 28 in Arabic, 32 in Persian and 39 in Urdu. This paper presents the various scripts of Sindhi Language including very old scripts, and issues and challenges in Sindhi OCR posed by cursive nature and other features of the current standard script. The main challenges include cursiveness, more characters dots, and variation of the placement and orientation of dots, four dotted characters, a large set of characters for recognition, Unicode representation, more base shape group characters, same base shape with variation in number and placement and orientation of dots, ambiguity between the characters with very slight difference, more characters with dots, context sensitive shapes, ligatures, noise, skew and fonts in Sindhi OCR. We also provide a summary of issues and challenges for the development of Sindhi OCR. This summary is useful for the researchers of OCR as well on Sindhi computing.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Sindh University Research Journal - SURJ (Science Series)

 Copyright © University of Sindh, Jamshoro. 2017 All Rights Reserved.
Printing and Publication by: Sindh University Press.