Fake news classifier using Multilayer Perceptron model

Saif Gazali
7 min readApr 27, 2021

--

Photo by essentiel-sante magazine

Introduction

The reliability of the information on digital media has become an issue affecting the social fabric of our society. The spread of information however significantly biased, distorted, or amplified on social media spanning millions of users breaking geographical barriers can cause real-world impacts within a matter of minutes.

A study found that humans are able to detect fake news only 75 percent of the time. Our objective is to build a model which would classify a news article as fake or not by considering the title of the article.

Dataset

The dataset used is from Kaggle and can be found here. Our dataset has the following attributes

  1. ID: Unique id for a news article
  2. Title: the title of a news article
  3. author: author of the new article
  4. text: the text of the article
  5. label: a label that marks the article as potentially fake or real ( 0: fake, 1: real)

The main aim is to predict whether a given news article is fake or real. In this article, we will be classifying the news as fake or real by considering its title rather than the text of the article to keep our corpus size less.

data = pd.read_csv('train.csv')
data.head(10)
Dataset — Title is used to predict if an article is fake or not

Data Preprocessing

The total number of rows in our dataset is 20,800. Dropping null values and resetting our index gives us around 18285 rows.

len(data) #20800
data.dropna(inplace=True)
data.reset_index(inplace=True)
len(data) #18285

Our preprocessing involves 6 steps and we would be processing all the titles in our dataset using the steps.

  1. Split a title that will give us a list of words in that title.
  2. Remove all the words that are punctuations using string.punctuation.
  3. Remove all the words having characters other than [a-zA-Z] using the isalpha() method.
  4. Remove all the stop words from the title.
  5. Make all the words in the title lowercase.
  6. Keep a count of all the words.
import string
import nltk
stop_words = nltk.corpus.stopwords.words('english')
corpus = []

Creating a method for cleaning our text.

def clean_text(data,vocab):
for i in range(len(data)):
txt = data['title'][i]
txt = txt.split()

txt = [word for word in txt if word not in string.punctuation] # Removing punctuations

txt = [word for word in txt if word.isalpha()]#removing all the words having characters other than letters

txt = [word for word in txt if word not in stop_words] #Removing all the stop words

txt = [word.lower() for word in txt] #making all the words lowercase

seq = ' '.join(txt)
split_seq = seq.split()

vocab.update(split_seq) # Keeping a count

for index in range(len(split_seq)): # putting all the words in our corpus
corpus.append(split_seq[index])
return corpus

Cleaning our data.

from collections import Counter
vocab = Counter()
clean_text(dataTrain,vocab)

Printing the most common 100 words.

print(vocab.most_common(100))
Most common 100 words

Removing all the words whose occurrence in our vocabulary is less than 20.

min_occurence = 20tokens =  [word for word,count in vocab.items() if count>=min_occurence]

After all the preprocessing we are left with around 1243 words in our corpus.

print(len(tokens)) # 1243
print(tokens[:100])
Words in our corpus

We would be preprocessing our titles again keeping only words that are present in our tokens list.

def clean_data(txt):
txt = txt.split()
txt = [word for word in txt if word not in string.punctuation]
txt = [word for word in txt if word.isalpha()]
txt = [word for word in txt if word not in stop_words]
txt = [word.lower() for word in txt]
txt = [word for word in txt if word in tokens]
seq = ' '.join(txt)
return seq

We will have a title list with all the preprocessed titles in it and the labels list, which would specify whether that article is fake.

index = 0
labels = []
titles = []
for index in range(len(data)):
clean_txt = clean_data(data['title'][index]) # cleaning up the title
titles.append(clean_txt.split()) # adding the preprocessed title to our list
labels.append(data['label'][index]) # adding the label for that article

Printing the first 4 articles' title and their labels.

print(titles[0],labels[0])
print(titles[1],labels[1])
print(titles[2],labels[2])
print(titles[3],labels[3])
First four articles title along with their labels

The next step is to encode each Title as a sequence of integers. Tokenizer class in the Keras API can be used to encode the Titles as a sequence of integers. We train the tokenizer on all the titles in our dataset and it develops a vocabulary of all the tokens in the dataset and uses these tokens to create a consistent mapping from tokens in the vocabulary to unique integers.

def create_tokenizer(titles):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(titles)
return tokenizer

Now that the mapping of words to integers has been prepared, we can use it to encode the reviews in the training dataset. We can do that by calling the texts to matrix function on the Tokenizer.

# Create the tokenizer
tokenizer = create_tokenizer(titles)
X = tokenizer.texts_to_matrix(titles,mode='freq')

Checking the shape.

print(X.shape) # (18285, 1244)
y = labels
print(y.shape) #(18285,)

Converting into NumPy arrays.

Xtrain,Ytrain = np.array(Xtrain),np.array(Ytrain)

Preparing Training and Test dataset

Splitting our dataset into training(75%) and testing(25%). Using stratify keyword while splitting our dataset will avoid the issue of class imbalance in our training and test dataset. For example, if there are 25% of zeros and 75% of ones, stratify=y will make sure that our random split has 25% of 0's and 75% of 1's.

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(Xtrain,Ytrain,test_size=0.25,stratify=Ytrain)

Printing the shape of our training and test dataset.

print(X_train.shape) #(13713, 1244)
print(X_test.shape) # (4572, 1244)
print(Y_train.shape) # (4572, 1244)
print(Y_test.shape) # (4572, 1244)

Model Creation

Multilayer perception (MLP) is a class of artificial neural networks that consists of multiple layers of perceptions. We use a fine-tuned MLP model in order to predict whether a given news article is fake or not.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

We will use four hidden layers with 50,100,150,200 neurons respectively and a rectified linear activation function for each of them. The output layer is a single neuron with a sigmoid activation function for predicting 0 for fake and 1 for a real news article. The network will be trained using the efficient Adam implementation of gradient descent and the binary cross-entropy loss function, suited to binary classification problems. We will be using dropout to reduce overfitting.

model = Sequential()

model.add(Dense(50,input_shape=(n_words,),activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(100,activation='relu'))
model.add(Dropout(0.3))

model.add(Dense(150,activation='relu'))
model.add(Dropout(0.4))

model.add(Dense(200,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

model.summary()
Model summary
model.fit(X_train,Y_train,epochs=15,verbose=2)
Fitting our model for 15 epochs

Evaluating our model on the test dataset gives us around 92.6 % accuracy.

model.evaluate(X_test,Y_test)
Evaluating our model on the test set

Confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
y_pred = model.predict(X_test)cm = confusion_matrix(Y_test,y_pred)group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in
cm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cm, annot=labels, fmt='', cmap='Blues')

Comparing different word scoring models

Tokenizer in Keras API has 4 different ways for scoring words

  1. Binary: We mark words based on whether they are present(1) or not(0).
  2. Count: Occurrence count for each word is marked as an integer
  3. TF-IDF: Each word is scored proportionately based on its frequency and inversely how common it is across all the titles
  4. Freq: Each word is scored based on the frequency of occurrence within the title

We will use our model to evaluate all the word scoring models.

def create_tokenizer(X,y,mode):
tk = Tokenizer()
tk.fit_on_texts(X)
X = tk.texts_to_matrix(X,mode=mode)
y = tk.texts_to_matrix(y,mode=mode)
return X,y

We will evaluate our model for all 4 types of encodings however we will estimate the model on an average of multiple runs (10) due to the stochastic nature of neural networks.

def evaluate_model(Xtrain,ytrain,Xtest,ytest):
scores = list()
for i in range(10):
model.fit(Xtrain,ytrain,epochs=15,verbose=2)
loss, acc = model.evaluate(Xtest,ytest)
scores.append(acc)
return scores

Creating encoding for each word scoring type and evaluating our model.

import pandas as pd
modes = ['binary','count','tfidf','freq']
results = pd.DataFrame()
for mode in modes:
X_train,X_test = create_tokenizer(Xtrain,Xtest,mode)
results[mode] = evaluate_model(X_train,ytrain,X_test,ytest)

Printing our results.

print(results)
Accuracy of our model for each of the 10 runs

Getting the statistical summary of our results.

print(results.describe())
Statistical summary of our results

Plotting a box plot to visualize the accuracies obtained for each word scoring model.

Box Plot for our results

The freq word scoring method gives better accuracy on an average when compared with our scoring methods according to the above visualization.

Pre-trained Embeddings such as word2vec or glove can be used in order to further improve the accuracy of our model.

Resources

Machine Learning Mastery

Confusion matrix Visualization

Keras API

--

--

No responses yet