SINDHI DIACRITICS RESTORATION BY LETTER LEVEL LEARNING APPROACH

J. A. MAHAR, G. Q. MEMON, H. SHAIKH

Abstract


Sindhi is one of those languages that require diacritics for exact reading and comprehension, but in routine compositions diacritics are almost ignored. Hence it brings about many syntactical, morphological and phonological ambiguities for computational processing. The diacritics can be restored at letter and word levels,  in  this  paper, letter  level  learning method is used for the task of Sindhi diacritics restoration in which surrounding letters of the specific letter are calculated and stored into a feature vector in order to compare them with the new examples which are input from the non-diacritized text. These letters are computed  with  different  window sizes, the N=5 is observed most efficient one. The k-nearest neighbor classifier is implemented for the classification of instances and at last, the nearest instance is taken for the replacement of non-diacritized letter. The evaluation of results is represented in terms of Diacritic Error Rate (DER), which is 1.9%. The proposed approach is tested on Sindhi but can be used for other Arabic script based languages because the character set of Sindhi is the superset of Arabic character set.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright (c) 2015 Sindh University Research Journal - SURJ (Science Series)

 Copyright © University of Sindh, Jamshoro. 2017 All Rights Reserved.
Printing and Publication by: Sindh University Press.