Sindhi Stemmer for Information Retrieval System Using Rule-Based Stripping Approach

M. R. SHAH, H. SHAIKH, J. A. MAHAR, S. A. MAHAR

Abstract


For last few years, huge amount of information in Sindhi language has made available online in the form of e-data. The information retrieval system is used to ensure easy and efficient access to the stored information. Stemmer is the tool, which information retrieval system uses to decrease morphological variants of a word to its root or stem. As yet no any information retrieval system and stemmer for Sindhi language is available, hence, the access to data resources is not possible. In this paper, an algorithm is proposed applying rule based approach. The proposed algorithm depends upon our developed lexicon and linguistic rules. The 5327 words are incorporated in the lexicon among them 2142 words are those having prefix or suffix word morphemes. A number of 38 rules included the repository. The performance of the prefix, suffix and combined prefix-suffix words of Sindhi language is separately calculated. The cumulative performance accuracy of 84.85% is calculated using developed stemmer. The outcome of this stemmer will be beneficial for the developers of automatic Sindhi information retrieval system.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Sindh University Research Journal - SURJ (Science Series)

 Copyright © University of Sindh, Jamshoro. 2017 All Rights Reserved.
Printing and Publication by: Sindh University Press.