Skip to main content

Posts

Showing posts with the label NTLK TOKENIZE

NTLK TOKENIZE / TOKENIZER

# -*- coding: utf-8 -*- # Natural Language Toolkit: Tokenizers # # Copyright (C) 2001-2017 NLTK Project # Author: Edward Loper # Steven Bird (minor additions) # Contributors: matthewmc, clouds56 # URL: # For license information, see LICENSE.TXT r""" NLTK Tokenizer Package Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string: >>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] This particular tokenizer requires the Punkt sente…