Stemming Indonesian - Techniques 2

Pre x Disallowed su xes
be- -i
di- -an
ke- -i, -kan
me- -an
se- -i,-kan
te- -an
Table 1: Disallowed pre x and su x combinations.
The only exception is that the root word \tahu" is
permitted with the pre x \ke-" and the su x \-i".
The square brackets indicate that an a x is optional.
The previous de nition forms the basis of the rules
used in the approach. However, there are exceptions
and limitations that are incorporated in the rules:
1. Not all combinations are possible. For example,
after a word is pre xed with \di-", the word is
not allowed to be su xed with \-an". A complete list is shown in Table 1
2. The same a x cannot be repeatedly applied. For
example, after a word is pre xed with \te-" or
one of its variations, it is not possible to repeat
the pre x \te-" or any of those variations
3. If a word has one or two characters, then stemming is not attempted.
4. Adding a pre x may change the root word or
a previously-applied pre x; we discuss this further in our description of the rules To illustrate, consider \meng-" that has the variations
\mem-",\meng-",\meny-", and \men-". Some of
these may change the pre x of a word, for example, for the root word \sapu" (broom), the variation applied is \meny-" to produce the word
\menyapu" (to sweep) in which the \s" is removed
The latter complication requires that an e ective Indonesian stemming algorithm be able to add deleted
letters through the recoding process.
The algorithm itself employs three components:
the a x groupings, the order of use rules (and their
exceptions), and a dictionary. The dictionary is
checked after any stemming rule succeeds: if the resultant word is found in the dictionary, then stemming
has succeeded in nding a root word, the algorithm
returns the dictionary word, and then stops; we omit
this lookup from each step in our rule listing. In addition, each step checks if the resultant word is less
than two characters in length and, if so, no further
stemming is attempted.
stemmed, the following steps
are followed:
1. The unstemmed word is searched for in the dictionary. If it is found in the dictionary, it is assumed the word is a root word, and so the word
is returned and the algorithm stops.
2. In
ection su xes (\-lah", \-kah", \-ku", \-mu",
or \-nya") are removed. If this succeeds and
the su x is a particle (\-lah" or \-kah"), this
step is again attempted to remove any in
ectional possessive pronoun su xes (\-ku", \-mu",
or \-nya").
3. Derivation su x (\-i" or \-an") removal is attempted. If this succeeds, Step 4 is attempted.
If Step 4 does not succeed:
(a) If \-an" was removed, and the nal letter
of the word is \-k", then the \-k" is also
removed and Step 4 is re-attempted. If that
fails, Step 3b is performed.
ShowHideComments