Skip to main content


Showing posts from May, 2017

What is Tokenization in Text Mining

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.
Tokenization relies mostly on simple heuristics in order to separate tokens by following a few steps:

Tokens or words are separated by whitespace, punctuation marks or line breaks
White space or punctuation marks may or may not…

Python NTLK Tokenize

def load_file(filename='text.txt'): """ Reads all text in filename, returns the following triplet: - list of all words - sentences (ordered list of words, per sentence) - POS-tags (ordered list of tags, per sentence) - chunks """ def process_raw_text(text): from nltk.tokenize import sent_tokenize, word_tokenize from nltk.tag import pos_tag, map_tag from itertools import chain from pattern.en import parsetree flatten = lambda x : list(chain(*x)) simplify_tag = lambda t : map_tag('en-ptb', 'universal', t) text = text.decode("utf8") chunks = [ [ c.type for c in t.chunks ] for t in parsetree(text) ] sentences = sent_tokenize(text) sentences = [ word_tokenize(s) for s in sentences ] sentences_tags = [ [ (w, simplify_tag(t)) for w, t in pos_tag(s) ] for s in sentences ] sentences = [ [ w for w, _ in s] for s in sentences_tags ] tags = [ [ t for _, t in s] for s in sentences_tags ] …


# -*- coding: utf-8 -*- # Natural Language Toolkit: Tokenizers # # Copyright (C) 2001-2017 NLTK Project # Author: Edward Loper # Steven Bird (minor additions) # Contributors: matthewmc, clouds56 # URL: # For license information, see LICENSE.TXT r""" NLTK Tokenizer Package Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string: >>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] This particular tokenizer requires the Punkt sente…

Stop Words List French