Tokenization | NLP Series Part 2

Rohan Kumawat
3 min readJun 8, 2021

--

Natural Language Processing (Part 2)

Tokenization is the most fundamental step in Natural Language Processing. It is a way of separating a piece of text into smaller chunks or text known as tokens. Tokens are the building block of Natural Language.

The process of breaking down a piece of text into small units called tokens. It may be a word, part of a word, or just characters like punctuation. It defines what our NLP models can express. A token is a string with a known Meaning.

This blog is part 2 of the Natural Language Processing series.

If you want to read part 1 of this series, go have a read at this link: https://kumawatrohan.medium.com/natural-language-processing-part-1-c3ae3a1e1115

Why Tokenization?

  1. Programming languages work by breaking up raw code into tokens and combining them with some logic in Natural Language Processing.
  2. By breaking up the text into small, known fragments, we can apply a small set of rules to combine them into some more extensive meaning.

Example: If you’ve ever tried to learn a language other than your mother tongue, then you might have faced a similar thing where you have to grab the vocabulary + their grammar rules + context and then you can form a sentence.

Types of Tokenization

There are two types of Tokenization:

  1. Word Tokenization: It splits a piece of text into individual words based on a certain delimiter.
Word Tokenization

2. Sentence Tokenization: It splits a paragraph or a text into sentences.

Sentence Tokenization

Why do we need to do sentence tokenization when we have word Tokenization?

Imagine you need to count average words per sentence; how will you calculate? It is one of the reasons why we need correction Tokenization too.

Implementation

Many libraries can perform Natural Language Processing tasks, but I’ll implement Tokenization using the “nltk” library. Steps to perform Tokenization:

  1. Install the nltk library.
Install the nltk library

2. After installing the library, we need to import the library and download all the models and dictionaries present in the library to perform NLP tasks.

Import library

3. We need a paragraph to perform Tokenization. Here I’ve taken few a lines from Lionel Messi’s interview.

Paragraph

4. Perform Sentence Tokenization.

Sentence Tokenization

5. Perform Word Tokenization. We can see here that “.” , “‘ve”, “,” are also taken as different words.

Word Tokenization

This is how we perform Tokenization. That was all about Part 2 of the NLP series, and Part 3 will be about Stemming and Lemmatization. If the word cap exceeds 500/600 words, then there will be two different blogs on both of these.

--

--

Rohan Kumawat

Technology Enthusiast | Data Science | Artificial Intelligence | Books | Productivity | Blockchain