Everything you need to know about Bag-of-Words | NLP Series Part 5

3 min readJun 11, 2021

“Language is one of the most impressive things humans do.”

We can understand this effortlessly, but it is difficult for a machine to understand it directly. We can’t now feed a piece of textual information to devices that only understands 0’s and 1’s and ask them to understand this. So, to make it an easy task for machines/devices to understand human language, we need to feed them so that it’s a vector/number instead of text. Gladly, researchers have researched many algorithms, and we can quickly learn about them. In this blog, we’ll talk about the Bag of Words Model/Algorithm.

This blog is part 5 of the Natural Language Processing series. If you haven’t read about Tokenization, Lemmatization, and Stemming, please read the earlier blogs.

Lemmatization | NLP Series Part 4

Lemmatization reduces the inflected words properly, ensuring the root word belongs to the language. In Lemmatization…

kumawatrohan.medium.com

The Bag of Words model is the easiest method for converting a text into vectors/numbers. It is used in Natural Language Processing and Information Retrieval. The approach is straightforward and flexible. It represents text that describes the occurrence of words within a document (How many times a word is occurring in a text?). It involves two things:

A vocabulary of known words.
Creating a Histogram.

Step-by-Step Intuition

Collect Data.
Convert it into a list of words.

3. Apply Stemming or Lemmatization.

4. Remove stop words.

5. Create a Histogram (It looks at how many times a word is getting repeated.)

6. Now apply Bag-of-Words. (Convert out frequency table/histogram into a vector table).

Implementation

Import the required libraries.

2. To perform Bag-of-Words, I’ll use the same paragraph.

3. Perform Sentence Tokenization and call WordNetLemmatizer() and PorterStemmer().

4. Perform both WordNetLemmatizer() and PorterStemmer(). I’m applying both of them to show the difference between them.

5. Perform Bag-of-Words algorithm.

After executing the following steps, we can see that Lemmatization could extract 36 words, and Stemming was able to pull 34 words only.

Drawback of Bag-of-Words

The model neglects the location information of the word. The location information is a piece of crucial information in the text.
The bag of word models doesn’t respect the semantics of the word.

3. The array of vocabulary is a big problem faced by the Bag-of-Words model. If a new word gets introduced in a sentence, it will not recognize or analyze it.

This blog was all about Bag-of-Words. I hope you learned something with it, and the upcoming blog will be about TF-IDF.