Stemming | NLP Series Part 3
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes. Or it is the process of lowering inflexion in terms of their root forms, such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.
Stem root is the part of the word to which you add inflectionally (changing/diverging) affixes such as (-ed, -ize, -s, -de). So stemming a word or sentence may result in terms that are not actual words. Stems are created by removing the suffixes or prefixes used with a comment.
There are various types of Stemming Algorithms available, but we will not talk about that in this blog. Stemming uses Suffix Stripping. Before jumping into the implementation of Stemming, let’s understand Stop Words.
This blog is part 3 of an NLP series. To learn about Tokenization, please read this:
Stop Words
Stop Words are commonly used words in a text; without them, a sentence will be incomplete, but they are in abundance out there. For example: “the”, “a”, “an”, “that”, “this”. While a machine is processing a piece of textual information, it does not need “stop words”. For example, while a search engine like Google searches for the query one enquired, it removes the “stop words” and then queries the question.
Why do we remove Stop Words?
If I ask you, “What was the outcome of the World Cup match between Argentina and France in 2018?” your brain will look at the main keywords here: “outcome”, “World cup match”, “Argentina”, “France”, and “2018” and then you’ll give me the answer.
We remove the low-level information from our text by removing the stop words to focus on the critical data. By removing these words, there is no negative consequence on the model we are going to train.
Removing these stop words reduces the dataset size and thus reduces the training time too.
Implementation
Steps to perform Stemming are as follows:
- Import the libraries.
2. To perform stemming, I’ll use the same paragraph.
3. Perform Sentence Tokenization and call PorterStemmer().
4. Perform Stemming.
These steps show how we perform Stemming on textual data. There is more on Stemming, like various algorithms used and why PorterStemmer(). I’ll try to include that in a different blog, but this is to get you started with Natural Language Processing. This blog was all about Part 3 of the NLP series.