Fake News Classifier Part 1 | NLP Series II

4 min readJun 20, 2021

In the Natural Language Processing Series II, we are now building a Fake News Classifier model. This blog will deal with how to deal with a dataset at the start. The following blogs of this project will deal with how to improve our model. In the last blog, we made an SMS Spam Classifier. To read about that blog, go to this link:

SMS Spam Part I | NLP Series 2

We all know that the internet and social media have become the quickest and most straightforward ways to get…

kumawatrohan.medium.com

Implementation

Now let’s implement Fake News Classification using a dataset provided by the Kaggle Community. It was an InClass competition launched by Kaggle 3 years ago. Here’s a link to the dataset:

Fake News

Build a system to identify unreliable news articles

www.kaggle.com

First, we’ll load the required libraries. Here’s a list of libraries that we require:

Numpy.
Pandas.
Regular Expression (re).
Natural Language Toolkit (nltk).
Stopwords from the nltk library.
For stemming, we require PorterStemmer from the nltk library.
Bag-of-words.
Train Test split.
Naive Bayes.
Accuracy Score.

There are three datasets here; one is training, testing and submission. The training dataset has five columns, the test dataset has four, and the submission dataset has two. Most of the Kaggle competitions have these three datasets available. Train dataset is used to fit our model, test for testing the accuracy of our model and submission to submit our results.

Let’s look at our dataset. The training dataset contains four columns: title, author, text and label. We’ll mainly focus on the title column in this blog and then look at how it turns out.

Left: Importing the datasets; Right: Train dataset snippet.

This dataset has null values, so we have to take care of them. In this blog, we’ll deal with them by removing them from the training dataset. We need to reset the index by dropping the columns; the index isn’t in order anymore. So, we need to reorder or reset them.

After 5, 7 came! (That’s why we need to do reindexing.)

Now, we’ll repeat the same process we did in the Spam Classifier Project. We will preprocess the data, then apply the bag-of-words model to convert the text to vectors. Our data is ready to feed to the model. We’ll train-test split the dataset. Then in this blog, we’ll use the Naive Bayes algorithm to fit our model. Now we will predict using the test split we just did, and we got an excellent accuracy of around 90.19%.

Left: Train dataset preprocessing; Right: Fitting our data.

Now’s the time to predict on the test dataset provided by Kaggle. Before predicting, we’ve to preprocess our data and then convert it to vectors to predict the labels. Our data is ready to predict now and then update our submission dataset.

Pre-processing the test dataset and predicting.

Now we’re ready to submit our results so that Kaggle can calculate the actual answers. After submitting the answers, we only got 65.64% accuracy. This accuracy is not good.

We can see that we can’t predict if the news is fake or not better by using the Title feature. Can we go beyond this accuracy and make a better model which can give us better results? In Part II of this project, we’ll use the Text feature, Lemmatization, and TF-IDF. Let’s find out if we can go up the leaderboard.