TF-IDF | NLP Series Part 6

Rohan Kumawat
4 min readJun 12, 2021

--

Natural Language Processing (Part 6)

In the last blog, we left off with the drawbacks of Bag-of-Words. The solution to Bag-of-Words is TF-IDF. TF-IDF stands for “Term Frequency — Inverse Document Frequency”. TF-IDF is a statistical number that evaluates how relevant a word is to a sentence/paragraph/document in a collection/corpus. This algorithm helps machines decipher words by allocating them a numerical value or vector to pass to a model. We can say that it is an advanced version of Bag-of-Words as it:

  1. It doesn’t neglect the location information of the word.
  2. It respects the semantics of the word.

This blog is the last part of the NLP Series. To read about Bag-of-Words, please read this blog.

Intuition

  1. There are several ways of calculating the “Term Frequency”. The easiest method is to raw count the instances of a word that appears in a document. After counting, we need to adjust the frequency by the frequency of the most frequent word or length of a document.
Left image: Important Formulas; Right image: Normal sentences
Term Frequency

2. The “Inverse Document Frequency” means how common or rare a word is in the entire document set. The closer the word is to 0, the more familiar a word is. This metric is calculated by dividing the total number of documents by the number of documents that contain a word. After performing this, we apply the logarithmic function to it.

Inverse Document Frequency

So, if the word is ubiquitous and appears in many documents, this no. will approach 0. Otherwise, it will close to 1.

3. We multiply these two numbers and get the TF-IDF score of a word in a particular document. The higher the score, the more relevant that word is for that specific document.

Tf * IDF

By applying TF-IDF to our algorithm, familiar words in every document like “this”, “what”, “if”, etc. rank low even though they appear many times throughout the passage/document.

Implementation

  1. Import the required libraries.
Import libraries

2. To perform TF-IDF, I’ll use the same paragraph.

Paragraph

3. Perform Sentence Tokenization and call WordNetLemmatizer() and PorterStemmer().

Sentence Tokenization

4. Perform both WordNetLemmatizer() and PorterStemmer(). I’m applying both of them to show the difference between them.

Left: Stemming, Right: Lemmatization

5. Perform the TF-IDF algorithm.

TF-IDF

This blog is about implementing and the intuition behind the TF-IDF algorithm on a paragraph, and it is the last part of the NLP series. I’m calling this series the Phase 1 / Level 1 of the Natural Language Processing series. I’ll be implementing some projects on these next and then will come up with Phase 2 / Level 2 series of it where we will learn about how to implement all of these algorithms.

--

--

Rohan Kumawat

Technology Enthusiast | Data Science | Artificial Intelligence | Books | Productivity | Blockchain