Fake news classifier using Multilayer Perceptron model

7 min readApr 27, 2021

Introduction

The reliability of the information on digital media has become an issue affecting the social fabric of our society. The spread of information however significantly biased, distorted, or amplified on social media spanning millions of users breaking geographical barriers can cause real-world impacts within a matter of minutes.

A study found that humans are able to detect fake news only 75 percent of the time. Our objective is to build a model which would classify a news article as fake or not by considering the title of the article.

Dataset

The dataset used is from Kaggle and can be found here. Our dataset has the following attributes

ID: Unique id for a news article
Title: the title of a news article
author: author of the new article
text: the text of the article
label: a label that marks the article as potentially fake or real ( 0: fake, 1: real)

The main aim is to predict whether a given news article is fake or real. In this article, we will be classifying the news as fake or real by considering its title rather than the text of the article to keep our corpus size less.

data = pd.read_csv('train.csv')
data.head(10)

Dataset — Title is used to predict if an article is fake or not

Data Preprocessing

The total number of rows in our dataset is 20,800. Dropping null values and resetting our index gives us around 18285 rows.

len(data) #20800
data.dropna(inplace=True)
data.reset_index(inplace=True)
len(data) #18285

Our preprocessing involves 6 steps and we would be processing all the titles in our dataset using the steps.

Split a title that will give us a list of words in that title.
Remove all the words that are punctuations using string.punctuation.
Remove all the words having characters other than [a-zA-Z] using the isalpha() method.
Remove all the stop words from the title.
Make all the words in the title lowercase.
Keep a count of all the words.

import string
import nltk
stop_words = nltk.corpus.stopwords.words('english')
corpus = []

Creating a method for cleaning our text.

def clean_text(data,vocab):
    for i in range(len(data)):
        txt = data['title'][i]
        txt = txt.split()
        
        txt = [word for word in txt if word not in string.punctuation] # Removing punctuations
        
        txt = [word for word in txt if word.isalpha()]#removing all the words having characters other than letters
        
        txt = [word for word in txt if word not in stop_words] #Removing all the stop words
        
        txt = [word.lower() for word in txt] #making all the words lowercase
        
        seq = ' '.join(txt)
        split_seq = seq.split()
        
        vocab.update(split_seq)  # Keeping a count
        
        for index in range(len(split_seq)):  # putting all the words in our corpus
            corpus.append(split_seq[index])
    return corpus

Cleaning our data.

from collections import Counter
vocab = Counter()
clean_text(dataTrain,vocab)

Printing the most common 100 words.

print(vocab.most_common(100))

Removing all the words whose occurrence in our vocabulary is less than 20.

min_occurence = 20tokens =  [word for word,count in vocab.items() if count>=min_occurence]

After all the preprocessing we are left with around 1243 words in our corpus.

print(len(tokens)) # 1243
print(tokens[:100])

We would be preprocessing our titles again keeping only words that are present in our tokens list.

def clean_data(txt):
    txt = txt.split()
    txt = [word for word in txt if word not in string.punctuation]
    txt = [word for word in txt if word.isalpha()]
    txt = [word for word in txt if word not in stop_words]
    txt = [word.lower() for word in txt]
    txt = [word for word in txt if word in tokens]
    seq = ' '.join(txt)
    return seq

We will have a title list with all the preprocessed titles in it and the labels list, which would specify whether that article is fake.

index = 0
labels = []
titles = []
for index in range(len(data)):
    clean_txt = clean_data(data['title'][index]) # cleaning up the title
    titles.append(clean_txt.split()) # adding the preprocessed title to our list
    labels.append(data['label'][index]) # adding the label for that article

Printing the first 4 articles' title and their labels.

print(titles[0],labels[0])
print(titles[1],labels[1])
print(titles[2],labels[2])
print(titles[3],labels[3])

First four articles title along with their labels

The next step is to encode each Title as a sequence of integers. Tokenizer class in the Keras API can be used to encode the Titles as a sequence of integers. We train the tokenizer on all the titles in our dataset and it develops a vocabulary of all the tokens in the dataset and uses these tokens to create a consistent mapping from tokens in the vocabulary to unique integers.

def create_tokenizer(titles):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(titles)
    return tokenizer

Now that the mapping of words to integers has been prepared, we can use it to encode the reviews in the training dataset. We can do that by calling the texts to matrix function on the Tokenizer.

# Create the tokenizer
tokenizer = create_tokenizer(titles) 
X = tokenizer.texts_to_matrix(titles,mode='freq')

Checking the shape.

print(X.shape) # (18285, 1244)
y = labels 
print(y.shape) #(18285,)

Converting into NumPy arrays.

Xtrain,Ytrain = np.array(Xtrain),np.array(Ytrain)

Preparing Training and Test dataset

Splitting our dataset into training(75%) and testing(25%). Using stratify keyword while splitting our dataset will avoid the issue of class imbalance in our training and test dataset. For example, if there are 25% of zeros and 75% of ones, stratify=y will make sure that our random split has 25% of 0's and 75% of 1's.

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(Xtrain,Ytrain,test_size=0.25,stratify=Ytrain)

Printing the shape of our training and test dataset.

print(X_train.shape) #(13713, 1244)
print(X_test.shape) # (4572, 1244)
print(Y_train.shape) # (4572, 1244)
print(Y_test.shape) # (4572, 1244)

Model Creation

Multilayer perception (MLP) is a class of artificial neural networks that consists of multiple layers of perceptions. We use a fine-tuned MLP model in order to predict whether a given news article is fake or not.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

We will use four hidden layers with 50,100,150,200 neurons respectively and a rectified linear activation function for each of them. The output layer is a single neuron with a sigmoid activation function for predicting 0 for fake and 1 for a real news article. The network will be trained using the efficient Adam implementation of gradient descent and the binary cross-entropy loss function, suited to binary classification problems. We will be using dropout to reduce overfitting.

model = Sequential()
   
model.add(Dense(50,input_shape=(n_words,),activation='relu'))
model.add(Dropout(0.2))
    
model.add(Dense(100,activation='relu'))
model.add(Dropout(0.3))
    
model.add(Dense(150,activation='relu'))
model.add(Dropout(0.4))
    
model.add(Dense(200,activation='relu'))
model.add(Dropout(0.5))
    
model.add(Dense(1,activation='sigmoid'))
    
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
    
model.summary()

model.fit(X_train,Y_train,epochs=15,verbose=2)

Evaluating our model on the test dataset gives us around 92.6 % accuracy.

model.evaluate(X_test,Y_test)

Evaluating our model on the test set

Confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as snsy_pred = model.predict(X_test)cm = confusion_matrix(Y_test,y_pred)group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in
                cm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
                     cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cm, annot=labels, fmt='', cmap='Blues')

Comparing different word scoring models

Tokenizer in Keras API has 4 different ways for scoring words

Binary: We mark words based on whether they are present(1) or not(0).
Count: Occurrence count for each word is marked as an integer
TF-IDF: Each word is scored proportionately based on its frequency and inversely how common it is across all the titles
Freq: Each word is scored based on the frequency of occurrence within the title

We will use our model to evaluate all the word scoring models.

def create_tokenizer(X,y,mode):
    tk = Tokenizer()
    tk.fit_on_texts(X)
    X = tk.texts_to_matrix(X,mode=mode)
    y = tk.texts_to_matrix(y,mode=mode)
    return X,y

We will evaluate our model for all 4 types of encodings however we will estimate the model on an average of multiple runs (10) due to the stochastic nature of neural networks.

def evaluate_model(Xtrain,ytrain,Xtest,ytest):
    scores = list()
    for i in range(10):
        model.fit(Xtrain,ytrain,epochs=15,verbose=2)
        loss, acc = model.evaluate(Xtest,ytest)
        scores.append(acc)
    return scores

Creating encoding for each word scoring type and evaluating our model.

import pandas as pd
modes = ['binary','count','tfidf','freq']
results = pd.DataFrame()for mode in modes:
    X_train,X_test = create_tokenizer(Xtrain,Xtest,mode)results[mode] = evaluate_model(X_train,ytrain,X_test,ytest)

Printing our results.