A MODEL FOR SINDHI TEXT SEGMENTATION INTO WORD TOKENS

J. A. MAHAR, H. SHAIKH, G. Q. MEMON

Abstract


The corpus is prerequisite to conduct the experiments of computational linguistic applications on any language. Generally, the corpora are downloaded from Internet in different formats. Usually, the downloaded corpora have some types of word ambiguities regarding computational processes; however, it is observed that in Sindhi language, two types of ambiguities are commonly found i.e. compound words typed without embedded space and typo errors. Without correct segmentation of text into word tokens, it is difficult to get better results of linguistic applications. Therefore, tokenization is the inevitable component of natural language and speech processing applications. This paper presents a new model that correctly segments the words of Sindhi language. The model consists of three layers; layer 1 is used to input the text and segment the words using white space, simple and compound words are segmented in layer 2 and complex word are segmented in layer 3. The tokenizer is tested on 2792 Sindhi words and it achieved the accuracy of 91.76%. 


Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright (c) 2015 Sindh University Research Journal - SURJ (Science Series)

 Copyright © University of Sindh, Jamshoro. 2017 All Rights Reserved.
Printing and Publication by: Sindh University Press.