Everything you need to know about Bag-of-Words | NLP Series Part 5

Rohan Kumawat
3 min readJun 11, 2021
Natural Language Processing (Part 5)

“Language is one of the most impressive things humans do.”

We can understand this effortlessly, but it is difficult for a machine to understand it directly. We can’t now feed a piece of textual information to devices that only understands 0’s and 1’s and ask them to understand this. So, to make it an easy task for machines/devices to understand human language, we need to feed them so that it’s a vector/number instead of text. Gladly, researchers have researched many algorithms, and we can quickly learn about them. In this blog, we’ll talk about the Bag of Words Model/Algorithm.

This blog is part 5 of the Natural Language Processing series. If you haven’t read about Tokenization, Lemmatization, and Stemming, please read the earlier blogs.

The Bag of Words model is the easiest method for converting a text into vectors/numbers. It is used in Natural Language Processing and Information Retrieval. The approach is straightforward and flexible. It represents text that describes the occurrence of words within a document (How many times a word is occurring in a text?). It involves two things:

  1. A vocabulary of known words.
  2. Creating a Histogram.

Step-by-Step Intuition

  1. Collect Data.
  2. Convert it into a list of words.
Bag-of-words Intuition

3. Apply Stemming or Lemmatization.

4. Remove stop words.

5. Create a Histogram (It looks at how many times a word is getting repeated.)

6. Now apply Bag-of-Words. (Convert out frequency table/histogram into a vector table).

Bag-of-Words Intuition 2

Implementation

  1. Import the required libraries.
Libraries

2. To perform Bag-of-Words, I’ll use the same paragraph.

Paragraph

3. Perform Sentence Tokenization and call WordNetLemmatizer() and PorterStemmer().

4. Perform both WordNetLemmatizer() and PorterStemmer(). I’m applying both of them to show the difference between them.

Left: Stemming, Right: Lemmatization

5. Perform Bag-of-Words algorithm.

Bag-of-words

After executing the following steps, we can see that Lemmatization could extract 36 words, and Stemming was able to pull 34 words only.

Drawback of Bag-of-Words

  1. The model neglects the location information of the word. The location information is a piece of crucial information in the text.
  2. The bag of word models doesn’t respect the semantics of the word.
Drawback of Bag-of-Words

3. The array of vocabulary is a big problem faced by the Bag-of-Words model. If a new word gets introduced in a sentence, it will not recognize or analyze it.

This blog was all about Bag-of-Words. I hope you learned something with it, and the upcoming blog will be about TF-IDF.

--

--

Rohan Kumawat

Technology Enthusiast | Data Science | Artificial Intelligence | Books | Productivity | Blockchain