Stemming Indonesian - Abstract

Stemming Indonesian
Jelita Asian Hugh E. Williams S.M.M. Tahaghoghi
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia.
fjelita,hugh,saiedg@cs.rmit.edu.au

Abstract
Stemming words to (usually) remove su xes has applications in text search, machine translation, document summarisation, and text classi cation. For example, English stemming reduces the words \computer", \computing", \computation", and \computability" to their common morphological root,
\comput-". In text search, this permits a search for
\computers" to nd documents containing all words
with the stem \comput-". In the Indonesian language, stemming is of crucial importance: words
have pre xes, su xes, in xes, and con xes that make
matching related words di cult. In this paper, we
investigate the performance of ve Indonesian stemming algorithms through a user study. Our results
show that, with the availability of a reasonable dictionary, the unpublished algorithm of Nazief and Adriani correctly stems around 93% of word occurrences
to the correct root word. With the improvements we
propose, this almost reaches 95%. We conclude that
stemming for Indonesian should be performed using
our modi ed Nazief and Adriani approach.
Keywords: stemming, Indonesian Information Retrieval


ShowHideComments