Effective Techniques for Indonesian Text Retrieval

Effective Techniques for Indonesian Text Retrieval
A thesis submitted for the degree of
Doctor of Philosophy
Jelita Asian B.Comp. Sc.(Hons.),
School of Computer Science and Information Technology,
Science, Engineering, and Technology Portfolio,
RMIT University,
Melbourne, Victoria, Australia.
30th March, 2007Declaration
I certify that except where due acknowledgment has been made, the work is that of the
author alone; the work has not been submitted previously, in whole or in part, to qualify
for any other academic award; the content of the thesis is the result of work which has been
carried out since the official commencement date of the approved research program; and, any
editorial work, paid or unpaid, carried out by a third party is acknowledged.
Jelita Asian
School of Computer Science and Information Technology
RMIT University
30th March, 2007ii
Acknowledgments
First and foremost, I thank Justin Zobel, Saied Tahaghoghi, and Falk Scholer for their
patience and general academic and mo… Effective Techniques for Indonesian Text Retrieval

Stemming untuk Bahasa Indonesia

Information Retrieval : Stemming untuk Bahasa Indonesia

Kali ini saya akan membahas tentang Stemming. Tutorial ini sebenarnya merupakan bagian dari tugas yang diberikan pada matakuliah “Sistem Temu Kembali Informasi” atau kalau dalam bahasa inggris disebut juga “Information Retrieval System” atau kalau dalam istilah ilmu komputer sering disebut “Information Retrieval” atau biasa disingkat “IR”.

Lalu apa sih hubungannya IR dengan Stemming, kenapa harus ada stemming dan bagaimana proses stemming itu sendiri? Ok. sebelum kita bahas tutorialnya kita bahas dulu apa itu stemming.

Oke, jadi Stemming merupakan suatu proses untuk menemukan kata dasar dari sebuah kata. Proses stemming dilakukan dengan menghilangkan semua imbuhan (afiks) baik yang terdiri dari awalan (prefiks) sisipan (infiks) maupun akhiran (sufiks) dan kombinasi dari awalan dan akhiran (konfiks). Stemming ini digunakan untuk mengganti bentuk dari suatu kata menjadi kata dasar sesuai dengan struktur morfologi bahasa indonesia y… Stemming untuk Bahasa Indonesia

What is Tokenization in Text Mining

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.
Tokenization relies mostly on simple heuristics in order to separate tokens by following a few steps:

Tokens or words are separated by whitespace, punctuation marks or line breaks
White space or punctuation marks may or may not… What is Tokenization in Text Mining

Python NTLK Tokenize

def load_file(filename='text.txt'): """ Reads all text in filename, returns the following triplet: - list of all words - sentences (ordered list of words, per sentence) - POS-tags (ordered list of tags, per sentence) - chunks """ def process_raw_text(text): from nltk.tokenize import sent_tokenize, word_tokenize from nltk.tag import pos_tag, map_tag from itertools import chain from pattern.en import parsetree flatten = lambda x : list(chain(*x)) simplify_tag = lambda t : map_tag('en-ptb', 'universal', t) text = text.decode("utf8") chunks = [ [ c.type for c in t.chunks ] for t in parsetree(text) ] sentences = sent_tokenize(text) sentences = [ word_tokenize(s) for s in sentences ] sentences_tags = [ [ (w, simplify_tag(t)) for w, t in pos_tag(s) ] for s in sentences ] sentences = [ [ w for w, _ in s] for s in sentences_tags ] tags = [ [ t for _, t in s] for s in sentences_tags ] … Python NTLK Tokenize