2 Stemming Techniques
In this section, we describe the ve schemes we have
evaluated for Indonesian stemming. In particular, we
detail the approach of Nazief and Adriani, which performs the best in our evaluation of all approaches in
Section 4. We propose extensions to this approach in
Section 5.
2.1 Nazief and Adriani's Algorithm
The stemming scheme of Nazief and Adriani is described in an unpublished technical report from the
University of Indonesia (1996). In this section, we describe the steps of the algorithm, and illustrate each
with examples; however, for compactness, we omit
the detail of selected rule tables. We refer to this
approach as nazief.
The algorithm is based on comprehensive morphological rules that group together and encapsulate allowed and disallowed a xes, including pre xes, suf-
xes, in xes (insertions) and con xes (combination
of pre xes and su xes). The algorithm also supports
recoding, an approach to restore an initial letter that
was removed from a root word prior to prepending
a pre x. In addition, the algorithm makes use of an
auxiliary dictionary of root words that is used in most
steps to check if the stemming has arrived at a root
word.
Before considering how the scheme works, we consider the basic groupings of a xes used as a basis
for the approach, and how these de nitions are combined to form a framework to implement the rules.
The scheme groups a xes into categories:
1. In
ection su xes | the set of su xes that
do not alter the root word. For example,
\duduk" (sit) may be su xed with \-lah" to give
\duduklah" (please sit). The in
ections are further divided into:
(a) Particles (P) | including \-lah" and
\-kah", as used in words such as \duduklah"
(please sit)
(b) Possessive pronouns (PP) | including
\-ku", \-mu", and \-nya", as used in
\ibunya" (a third person possessive form of
\mother")
Particle and possessive pronoun in
ections can
appear together and, if they do, possessive pronouns appear before particles. A word can have
at most one particle and one possessive pronoun,
and these may be applied directly to root words
or to words that have a derivation su x. For
example, \makan" (to eat) may be appended
with derivation su x \-an" to give \makanan"
(food). This can be su xed with \-nya" to give
\makanannya" (a possessive form of \food")
2. Derivation su xes | the set of su xes that are
directly applied to root words. There can be only
one derivation su x per word. For example, the
word \lapor" (to report) can be su xed by the
derivation su x \{kan" to become \laporkan"
(go to report). In turn, this can be su xed with,
for example, an in
ection su x \-lah" to become
\laporkanlah" (please go to report)
3. Derivation pre xes | the set of pre xes that are
applied either directly to root words, or to words
that have up to two other derivation pre xes.
For example, the derivation pre xes \mem-" and
\per-" may be prepended to \indahkannya" to
give \memperindahkannya" (the act of beautifying)
The classi cation of a xes as in
ections and
derivations leads to an order of use:
[DP+[DP+[DP+]]] root-word [[+DS][+PP][+P]]
In this section, we describe the ve schemes we have
evaluated for Indonesian stemming. In particular, we
detail the approach of Nazief and Adriani, which performs the best in our evaluation of all approaches in
Section 4. We propose extensions to this approach in
Section 5.
2.1 Nazief and Adriani's Algorithm
The stemming scheme of Nazief and Adriani is described in an unpublished technical report from the
University of Indonesia (1996). In this section, we describe the steps of the algorithm, and illustrate each
with examples; however, for compactness, we omit
the detail of selected rule tables. We refer to this
approach as nazief.
The algorithm is based on comprehensive morphological rules that group together and encapsulate allowed and disallowed a xes, including pre xes, suf-
xes, in xes (insertions) and con xes (combination
of pre xes and su xes). The algorithm also supports
recoding, an approach to restore an initial letter that
was removed from a root word prior to prepending
a pre x. In addition, the algorithm makes use of an
auxiliary dictionary of root words that is used in most
steps to check if the stemming has arrived at a root
word.
Before considering how the scheme works, we consider the basic groupings of a xes used as a basis
for the approach, and how these de nitions are combined to form a framework to implement the rules.
The scheme groups a xes into categories:
1. In
ection su xes | the set of su xes that
do not alter the root word. For example,
\duduk" (sit) may be su xed with \-lah" to give
\duduklah" (please sit). The in
ections are further divided into:
(a) Particles (P) | including \-lah" and
\-kah", as used in words such as \duduklah"
(please sit)
(b) Possessive pronouns (PP) | including
\-ku", \-mu", and \-nya", as used in
\ibunya" (a third person possessive form of
\mother")
Particle and possessive pronoun in
ections can
appear together and, if they do, possessive pronouns appear before particles. A word can have
at most one particle and one possessive pronoun,
and these may be applied directly to root words
or to words that have a derivation su x. For
example, \makan" (to eat) may be appended
with derivation su x \-an" to give \makanan"
(food). This can be su xed with \-nya" to give
\makanannya" (a possessive form of \food")
2. Derivation su xes | the set of su xes that are
directly applied to root words. There can be only
one derivation su x per word. For example, the
word \lapor" (to report) can be su xed by the
derivation su x \{kan" to become \laporkan"
(go to report). In turn, this can be su xed with,
for example, an in
ection su x \-lah" to become
\laporkanlah" (please go to report)
3. Derivation pre xes | the set of pre xes that are
applied either directly to root words, or to words
that have up to two other derivation pre xes.
For example, the derivation pre xes \mem-" and
\per-" may be prepended to \indahkannya" to
give \memperindahkannya" (the act of beautifying)
The classi cation of a xes as in
ections and
derivations leads to an order of use:
[DP+[DP+[DP+]]] root-word [[+DS][+PP][+P]]