Stemming Indonesian - Introduction

1 Introduction
Stemming is a core natural language processing
technique for e cient and e ective Information Retrieval (Frakes 1992), and one that is widely accepted
by users. It is used to transform word variants to their
common root word by applying | in most cases |
morphological rules. For example, in text search, it
should permit a user searching using the query term
\stemming" to nd documents that contain the terms
\stemmer" and \stems" because all share the common root word \stem". It also has applications in
machine translation (Bakar & Rahman 2003), document summarisation (Oras an, Pekar & Hasler 2004),
and text classi cation (Gaustad & Bouma 2002).
For the English language, stemming is wellunderstood, with techniques such as those of
Lovin (1968) and Porter (1980) in widespread use.
However, stemming for other languages is less wellknown: while there are several approaches available
for languages such as French (Savoy 1993), Spanish (Xu & Croft 1998), Malaysian (Ahmad, Yuso
& Sembok 1996, Idris 2001), and Indonesian (Ari n
& Setiono 2002, Nazief & Adriani 1996, Vega 2001),
Copyright
c 2005, Australian Computer Society, Inc. This paper appeared at the 28th Australasian Computer Science Conference (ACSC2005), The University of Newcastle, Australia.
Conferences in Research and Practice in Information Technology, Vol. 38. V. Estivill-Castro, Ed. Reproduction for academic, not-for pro t purposes permitted provided this text is
included.

there is almost no consensus about their e ectiveness.
Indeed, for Indonesian the schemes are neither easily
accessible nor well-explored. There are no comparative studies that consider the relative e ectiveness of
alternative stemming approaches for this language.
Stemming is essential to support e ective Indonesian Information Retrieval, and has uses as diverse
as defence intelligence applications, document translation, and web search. Unlike English, a more complex class of a xes | which includes pre xes, suf-
xes, in xes (insertions), and con xes (combinations
of pre xes and su xes) | must be removed to transform a word to its root word, and the application
and order of the rules used to perform this process
requires careful consideration. Consider a simple example: the word \minuman" (meaning \a drink") has
the root \minum" (\to drink") and the su x \-an".
However, many examples do not share the simple suf-
x approach used by English:
\pemerintah" (meaning \government") is derived from the root \perintah" (meaning \`govern") through the process of inserting the in x
\em" between the \p-" and \-erintah" of \perintah".
\anaknya" (a possessive form of child, such as
\his/her child") has the pre x \anak" (\child")
and the su x \nya" (a third person possessive)
\duduklah" (please sit) is \duduk" (\sit") and
\lah" (a softening, equivalent to \please").
\buku-buku" (books) is the plural of \buku"
(\book")

These latter examples illustrate the importance of
stemming in Indonesian: without them, for example,
a document containing \anaknya" (\his/her child")
does not match the term \anak" (\child").
Several techniques have been proposed for stemming Indonesian. We evaluate these techniques
through a user study, where we compare the performance of the scheme to the results of manual
stemming by four native speakers. Our results show
that an existing technique, proposed by Nazief and
Adriani (1996) in an unpublished technical report,
correctly stems around 93% of all word occurrences
(or 92% of unique words). After classifying the failure cases, and adding our own rules to address these
limitations, we show this can be improved to 95%
for both unique and all word occurrences. We believe
that adding a more complete dictionary of root words
would improve these results even further. We conclude that our modi ed Nazief and Adriani stemmer
should be used in practice for stemming Indonesian.
ShowHideComments