AUTOMATIC DIACRITICS RESTORATION SYSTEM FOR SINDHI

J. A. MAHAR, G. Q. MEMON

Abstract


Sindhi  language  is  based  on  the pattern of Arabic  script  and  usually  both  are  written without diacritics in the routine applications. The absence of diacritics creates many ambiguities and confusions for the possible vowel sounds of the group of characters used in the composition of the word. Moreover, the morphological and lexical ambiguity is also a case for the correct pronunciation in computational systems. Realizing the cause, this paper is composed to present an innovated and improved mechanism that inserts the diacritic signs correctly into the non-diacritized text by the multiplications of three N-gram probabilities with Viterbi algorithm, the probabilities of words are calculated by using unigram, bigram and trigram models. The performance of system is achieved in word error rate as 0.71% and diacritic error rate as 3.21%. A few languages i.e., Arabic, Urdu and Persian have the same characteristics as Sindhi does for the reason proposed system may be useful for mentioned languages on same scale.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright (c) 2015 Sindh University Research Journal - SURJ (Science Series)

 Copyright © University of Sindh, Jamshoro. 2017 All Rights Reserved.
Printing and Publication by: Sindh University Press.