Effective Techniques for Indonesian Text Retrieval

Effective Techniques for Indonesian Text Retrieval
A thesis submitted for the degree of
Doctor of Philosophy
Jelita Asian B.Comp. Sc.(Hons.),
School of Computer Science and Information Technology,
Science, Engineering, and Technology Portfolio,
RMIT University,
Melbourne, Victoria, Australia.
30th March, 2007Declaration
I certify that except where due acknowledgment has been made, the work is that of the
author alone; the work has not been submitted previously, in whole or in part, to qualify
for any other academic award; the content of the thesis is the result of work which has been
carried out since the official commencement date of the approved research program; and, any
editorial work, paid or unpaid, carried out by a third party is acknowledged.
Jelita Asian
School of Computer Science and Information Technology
RMIT University
30th March, 2007ii
Acknowledgments
First and foremost, I thank Justin Zobel, Saied Tahaghoghi, and Falk Scholer for their
patience and general academic and mo… Effective Techniques for Indonesian Text Retrieval

Stemming untuk Bahasa Indonesia

Information Retrieval : Stemming untuk Bahasa Indonesia

Kali ini saya akan membahas tentang Stemming. Tutorial ini sebenarnya merupakan bagian dari tugas yang diberikan pada matakuliah “Sistem Temu Kembali Informasi” atau kalau dalam bahasa inggris disebut juga “Information Retrieval System” atau kalau dalam istilah ilmu komputer sering disebut “Information Retrieval” atau biasa disingkat “IR”.

Lalu apa sih hubungannya IR dengan Stemming, kenapa harus ada stemming dan bagaimana proses stemming itu sendiri? Ok. sebelum kita bahas tutorialnya kita bahas dulu apa itu stemming.

Oke, jadi Stemming merupakan suatu proses untuk menemukan kata dasar dari sebuah kata. Proses stemming dilakukan dengan menghilangkan semua imbuhan (afiks) baik yang terdiri dari awalan (prefiks) sisipan (infiks) maupun akhiran (sufiks) dan kombinasi dari awalan dan akhiran (konfiks). Stemming ini digunakan untuk mengganti bentuk dari suatu kata menjadi kata dasar sesuai dengan struktur morfologi bahasa indonesia y… Stemming untuk Bahasa Indonesia

PHP Stemmer Bahasa Indonesia

Information Retrieval : Stemming untuk Bahasa Indonesia

Kali ini saya akan membahas tentang Stemming. Tutorial ini sebenarnya merupakan bagian dari tugas yang diberikan pada matakuliah “Sistem Temu Kembali Informasi” atau kalau dalam bahasa inggris disebut juga “Information Retrieval System” atau kalau dalam istilah ilmu komputer sering disebut “Information Retrieval” atau biasa disingkat “IR”.

Lalu apa sih hubungannya IR dengan Stemming, kenapa harus ada stemming dan bagaimana proses stemming itu sendiri? Ok. sebelum kita bahas tutorialnya kita bahas dulu apa itu stemming.

Oke, jadi Stemming merupakan suatu proses untuk menemukan kata dasar dari sebuah kata. Proses stemming dilakukan dengan menghilangkan semua imbuhan (afiks) baik yang terdiri dari awalan (prefiks) sisipan (infiks) maupun akhiran (sufiks) dan kombinasi dari awalan dan akhiran (konfiks). Stemming ini digunakan untuk mengganti bentuk dari suatu kata menjadi kata dasar sesuai dengan struktur morfologi bahasa indonesia y… PHP Stemmer Bahasa Indonesia

Effective Techniques For Indonesia Text Retrieval

Effective Techniques for Indonesian Text Retrieval
A thesis submitted for the degree of
Doctor of Philosophy
Jelita Asian B.Comp. Sc.(Hons.),
School of Computer Science and Information Technology,
Science, Engineering, and Technology Portfolio,
RMIT University,
Melbourne, Victoria, Australia.
30th March, 2007Declaration
I certify that except where due acknowledgment has been made, the work is that of the
author alone; the work has not been submitted previously, in whole or in part, to qualify
for any other academic award; the content of the thesis is the result of work which has been
carried out since the official commencement date of the approved research program; and, any
editorial work, paid or unpaid, carried out by a third party is acknowledged.
Jelita Asian
School of Computer Science and Information Technology
RMIT University
30th March, 2007ii
Acknowledgments
First and foremost, I thank Justin Zobel, Saied Tahaghoghi, and Falk Scholer for their
patience and general academic and mo… Effective Techniques For Indonesia Text Retrieval

Stemming Indonesian - Techniques 2

Pre x Disallowed su xes
be- -i
di- -an
ke- -i, -kan
me- -an
se- -i,-kan
te- -an
Table 1: Disallowed pre x and su x combinations.
The only exception is that the root word \tahu" is
permitted with the pre x \ke-" and the su x \-i".
The square brackets indicate that an a x is optional.
The previous de nition forms the basis of the rules
used in the approach. However, there are exceptions
and limitations that are incorporated in the rules:
1. Not all combinations are possible. For example,
after a word is pre xed with \di-", the word is
not allowed to be su xed with \-an". A complete list is shown in Table 1
2. The same a x cannot be repeatedly applied. For
example, after a word is pre xed with \te-" or
one of its variations, it is not possible to repeat
the pre x \te-" or any of those variations
3. If a word has one or two characters, then stemming is not attempted.
4. Adding a pre x may change the root word or
a previously-applied pre x; we discuss thi… Stemming Indonesian - Techniques 2

Stemming Indonesian - Stemming Techniques 1

2 Stemming Techniques
In this section, we describe the ve schemes we have
evaluated for Indonesian stemming. In particular, we
detail the approach of Nazief and Adriani, which performs the best in our evaluation of all approaches in
Section 4. We propose extensions to this approach in
Section 5.

2.1 Nazief and Adriani's Algorithm
The stemming scheme of Nazief and Adriani is described in an unpublished technical report from the
University of Indonesia (1996). In this section, we describe the steps of the algorithm, and illustrate each
with examples; however, for compactness, we omit
the detail of selected rule tables. We refer to this
approach as nazief.
The algorithm is based on comprehensive morphological rules that group together and encapsulate allowed and disallowed a xes, including pre xes, suf-
xes, in xes (insertions) and con xes (combination
of pre xes and su xes). The algorithm also supports
recoding, an approach to restore an initial letter that
was removed from a root… Stemming Indonesian - Stemming Techniques 1

Stemming Indonesian - Introduction

1 Introduction
Stemming is a core natural language processing
technique for e cient and e ective Information Retrieval (Frakes 1992), and one that is widely accepted
by users. It is used to transform word variants to their
common root word by applying | in most cases |
morphological rules. For example, in text search, it
should permit a user searching using the query term
\stemming" to nd documents that contain the terms
\stemmer" and \stems" because all share the common root word \stem". It also has applications in
machine translation (Bakar & Rahman 2003), document summarisation (Oras an, Pekar & Hasler 2004),
and text classi cation (Gaustad & Bouma 2002).
For the English language, stemming is wellunderstood, with techniques such as those of
Lovin (1968) and Porter (1980) in widespread use.
However, stemming for other languages is less wellknown: while there are several approaches available
for languages such as French (Savoy 1993), Spanish (Xu &… Stemming Indonesian - Introduction

Stemming Indonesian - Abstract

Stemming Indonesian
Jelita Asian Hugh E. Williams S.M.M. Tahaghoghi
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia.
fjelita,hugh,saiedg@cs.rmit.edu.au

Abstract
Stemming words to (usually) remove su xes has applications in text search, machine translation, document summarisation, and text classi cation. For example, English stemming reduces the words \computer", \computing", \computation", and \computability" to their common morphological root,
\comput-". In text search, this permits a search for
\computers" to nd documents containing all words
with the stem \comput-". In the Indonesian language, stemming is of crucial importance: words
have pre xes, su xes, in xes, and con xes that make
matching related words di cult. In this paper, we
investigate the performance of ve Indonesian stemming algorithms through a user study. Our results
show that, with the availability of a reasonable dic… Stemming Indonesian - Abstract

Algoritma Stemming Bahasa Indonesia Perl

Algoritma Stemmer Bahasa Indonesia


Program pencari kata dasar (stemmer) dalam Bahasa Indonesia, dibuat dengan bahasa pemrograman Perl. Program ini bekerja menggunakan kamus kata dasar, menurut pola kata berimbuhan sesuai pedoman Ejaan Yang Disempurnakan (EYD). Semoga bermanfaat.
1. Pendahuluan Struktur pembentukan kata dalam Bahasa Indonesia adalah sebagai berikut: [awalan-1] + [awalan-2] + dasar + [akhiran] + [kepunyaan] + [sandang] Masing-masing bagian tersebut (yang dalam kotak bisa ada atau tidak), digabungkan dengan kata dasar membentuk kata berimbuhan. Di bawah ini imbuhan yang banyak digunakan dalam Bahasa Indonesia : Kata sandang: -lah, -kah, -pun, -tah.Kata kepunyaan: -ku, -mu, -nya.Akhiran: -i, -an, -kan.Awalan: me-, ber-, pe-, di-, ke-, ter-, se-. Dalam proses penggabungan awalan, terdapat aturan-aturan berikut: Awalan Perubahan Bentuk Aturan me | pe meng | peng + V | k | g | h | q … Misal: mengambil = meng + ambil V = Vokal (a, i, u, e, o)
meny | peny + s … Misal: penyakit = … Algoritma Stemming Bahasa Indonesia Perl