Fake news classifier using Multilayer Perceptron model
Introduction
The reliability of the information on digital media has become an issue affecting the social fabric of our society. The spread of information however significantly biased, distorted, or amplified on social media spanning millions of users breaking geographical barriers can cause real-world impacts within a matter of minutes.
A study found that humans are able to detect fake news only 75 percent of the time. Our objective is to build a model which would classify a news article as fake or not by considering the title of the article.
Dataset
The dataset used is from Kaggle and can be found here. Our dataset has the following attributes
- ID: Unique id for a news article
- Title: the title of a news article
- author: author of the new article
- text: the text of the article
- label: a label that marks the article as potentially fake or real ( 0: fake, 1: real)
The main aim is to predict whether a given news article is fake or real. In this article, we will be classifying the news as fake or real by considering its title rather than the text of the article to keep our corpus size less.
data = pd.read_csv('train.csv')
data.head(10)
Data Preprocessing
The total number of rows in our dataset is 20,800. Dropping null values and resetting our index gives us around 18285 rows.
len(data) #20800
data.dropna(inplace=True)
data.reset_index(inplace=True)
len(data) #18285
Our preprocessing involves 6 steps and we would be processing all the titles in our dataset using the steps.
- Split a title that will give us a list of words in that title.
- Remove all the words that are punctuations using string.punctuation.
- Remove all the words having characters other than [a-zA-Z] using the isalpha() method.
- Remove all the stop words from the title.
- Make all the words in the title lowercase.
- Keep a count of all the words.
import string
import nltk
stop_words = nltk.corpus.stopwords.words('english')
corpus = []
Creating a method for cleaning our text.
def clean_text(data,vocab):
for i in range(len(data)):
txt = data['title'][i]
txt = txt.split()
txt = [word for word in txt if word not in string.punctuation] # Removing punctuations
txt = [word for word in txt if word.isalpha()]#removing all the words having characters other than letters
txt = [word for word in txt if word not in stop_words] #Removing all the stop words
txt = [word.lower() for word in txt] #making all the words lowercase
seq = ' '.join(txt)
split_seq = seq.split()
vocab.update(split_seq) # Keeping a count
for index in range(len(split_seq)): # putting all the words in our corpus
corpus.append(split_seq[index])
return corpus
Cleaning our data.
from collections import Counter
vocab = Counter()
clean_text(dataTrain,vocab)
Printing the most common 100 words.
print(vocab.most_common(100))
Removing all the words whose occurrence in our vocabulary is less than 20.
min_occurence = 20tokens = [word for word,count in vocab.items() if count>=min_occurence]
After all the preprocessing we are left with around 1243 words in our corpus.
print(len(tokens)) # 1243
print(tokens[:100])
We would be preprocessing our titles again keeping only words that are present in our tokens list.
def clean_data(txt):
txt = txt.split()
txt = [word for word in txt if word not in string.punctuation]
txt = [word for word in txt if word.isalpha()]
txt = [word for word in txt if word not in stop_words]
txt = [word.lower() for word in txt]
txt = [word for word in txt if word in tokens]
seq = ' '.join(txt)
return seq
We will have a title list with all the preprocessed titles in it and the labels list, which would specify whether that article is fake.
index = 0
labels = []
titles = []
for index in range(len(data)):
clean_txt = clean_data(data['title'][index]) # cleaning up the title
titles.append(clean_txt.split()) # adding the preprocessed title to our list
labels.append(data['label'][index]) # adding the label for that article
Printing the first 4 articles' title and their labels.
print(titles[0],labels[0])
print(titles[1],labels[1])
print(titles[2],labels[2])
print(titles[3],labels[3])
The next step is to encode each Title as a sequence of integers. Tokenizer class in the Keras API can be used to encode the Titles as a sequence of integers. We train the tokenizer on all the titles in our dataset and it develops a vocabulary of all the tokens in the dataset and uses these tokens to create a consistent mapping from tokens in the vocabulary to unique integers.
def create_tokenizer(titles):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(titles)
return tokenizer
Now that the mapping of words to integers has been prepared, we can use it to encode the reviews in the training dataset. We can do that by calling the texts to matrix function on the Tokenizer.
# Create the tokenizer
tokenizer = create_tokenizer(titles)
X = tokenizer.texts_to_matrix(titles,mode='freq')
Checking the shape.
print(X.shape) # (18285, 1244)
y = labels
print(y.shape) #(18285,)
Converting into NumPy arrays.
Xtrain,Ytrain = np.array(Xtrain),np.array(Ytrain)
Preparing Training and Test dataset
Splitting our dataset into training(75%) and testing(25%). Using stratify keyword while splitting our dataset will avoid the issue of class imbalance in our training and test dataset. For example, if there are 25% of zeros and 75% of ones, stratify=y
will make sure that our random split has 25% of 0
's and 75% of 1
's.
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(Xtrain,Ytrain,test_size=0.25,stratify=Ytrain)
Printing the shape of our training and test dataset.
print(X_train.shape) #(13713, 1244)
print(X_test.shape) # (4572, 1244)
print(Y_train.shape) # (4572, 1244)
print(Y_test.shape) # (4572, 1244)
Model Creation
Multilayer perception (MLP) is a class of artificial neural networks that consists of multiple layers of perceptions. We use a fine-tuned MLP model in order to predict whether a given news article is fake or not.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
We will use four hidden layers with 50,100,150,200 neurons respectively and a rectified linear activation function for each of them. The output layer is a single neuron with a sigmoid activation function for predicting 0 for fake and 1 for a real news article. The network will be trained using the efficient Adam implementation of gradient descent and the binary cross-entropy loss function, suited to binary classification problems. We will be using dropout to reduce overfitting.
model = Sequential()
model.add(Dense(50,input_shape=(n_words,),activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(100,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(150,activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(200,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
model.fit(X_train,Y_train,epochs=15,verbose=2)
Evaluating our model on the test dataset gives us around 92.6 % accuracy.
model.evaluate(X_test,Y_test)
Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as snsy_pred = model.predict(X_test)cm = confusion_matrix(Y_test,y_pred)group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in
cm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cm, annot=labels, fmt='', cmap='Blues')
Comparing different word scoring models
Tokenizer in Keras API has 4 different ways for scoring words
- Binary: We mark words based on whether they are present(1) or not(0).
- Count: Occurrence count for each word is marked as an integer
- TF-IDF: Each word is scored proportionately based on its frequency and inversely how common it is across all the titles
- Freq: Each word is scored based on the frequency of occurrence within the title
We will use our model to evaluate all the word scoring models.
def create_tokenizer(X,y,mode):
tk = Tokenizer()
tk.fit_on_texts(X)
X = tk.texts_to_matrix(X,mode=mode)
y = tk.texts_to_matrix(y,mode=mode)
return X,y
We will evaluate our model for all 4 types of encodings however we will estimate the model on an average of multiple runs (10) due to the stochastic nature of neural networks.
def evaluate_model(Xtrain,ytrain,Xtest,ytest):
scores = list()
for i in range(10):
model.fit(Xtrain,ytrain,epochs=15,verbose=2)
loss, acc = model.evaluate(Xtest,ytest)
scores.append(acc)
return scores
Creating encoding for each word scoring type and evaluating our model.
import pandas as pd
modes = ['binary','count','tfidf','freq']
results = pd.DataFrame()for mode in modes:
X_train,X_test = create_tokenizer(Xtrain,Xtest,mode)results[mode] = evaluate_model(X_train,ytrain,X_test,ytest)
Printing our results.
print(results)
Getting the statistical summary of our results.
print(results.describe())
Plotting a box plot to visualize the accuracies obtained for each word scoring model.
The freq word scoring method gives better accuracy on an average when compared with our scoring methods according to the above visualization.
Pre-trained Embeddings such as word2vec or glove can be used in order to further improve the accuracy of our model.
Resources