TF-IDF | NLP Series Part 6
In the last blog, we left off with the drawbacks of Bag-of-Words. The solution to Bag-of-Words is TF-IDF. TF-IDF stands for “Term Frequency — Inverse Document Frequency”. TF-IDF is a statistical number that evaluates how relevant a word is to a sentence/paragraph/document in a collection/corpus. This algorithm helps machines decipher words by allocating them a numerical value or vector to pass to a model. We can say that it is an advanced version of Bag-of-Words as it:
- It doesn’t neglect the location information of the word.
- It respects the semantics of the word.
This blog is the last part of the NLP Series. To read about Bag-of-Words, please read this blog.
Intuition
- There are several ways of calculating the “Term Frequency”. The easiest method is to raw count the instances of a word that appears in a document. After counting, we need to adjust the frequency by the frequency of the most frequent word or length of a document.
2. The “Inverse Document Frequency” means how common or rare a word is in the entire document set. The closer the word is to 0, the more familiar a word is. This metric is calculated by dividing the total number of documents by the number of documents that contain a word. After performing this, we apply the logarithmic function to it.
So, if the word is ubiquitous and appears in many documents, this no. will approach 0. Otherwise, it will close to 1.
3. We multiply these two numbers and get the TF-IDF score of a word in a particular document. The higher the score, the more relevant that word is for that specific document.
By applying TF-IDF to our algorithm, familiar words in every document like “this”, “what”, “if”, etc. rank low even though they appear many times throughout the passage/document.
Implementation
- Import the required libraries.
2. To perform TF-IDF, I’ll use the same paragraph.
3. Perform Sentence Tokenization and call WordNetLemmatizer() and PorterStemmer().
4. Perform both WordNetLemmatizer() and PorterStemmer(). I’m applying both of them to show the difference between them.
5. Perform the TF-IDF algorithm.
This blog is about implementing and the intuition behind the TF-IDF algorithm on a paragraph, and it is the last part of the NLP series. I’m calling this series the Phase 1 / Level 1 of the Natural Language Processing series. I’ll be implementing some projects on these next and then will come up with Phase 2 / Level 2 series of it where we will learn about how to implement all of these algorithms.