What is Tokenization in Text Mining

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.
Tokenization relies mostly on simple heuristics in order to separate tokens by following a few steps:

Tokens or words are separated by whitespace, punctuation marks or line breaks
White space or punctuation marks may or may not… What is Tokenization in Text Mining

NTLK TOKENIZE / TOKENIZER

# -*- coding: utf-8 -*- # Natural Language Toolkit: Tokenizers # # Copyright (C) 2001-2017 NLTK Project # Author: Edward Loper # Steven Bird (minor additions) # Contributors: matthewmc, clouds56 # URL: # For license information, see LICENSE.TXT r""" NLTK Tokenizer Package Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string: >>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] This particular tokenizer requires the Punkt sente… NTLK TOKENIZE / TOKENIZER