N-gram CNN model for sentimental analysis
N-gram models use multiple parallel convolutional neural networks that reads the source text using different kernel sizes which is an expanded version of a standalone model that has a word embedding layer and one-dimensional convolutional neural networks. The text is therefore read with different n-gram sizes by a multichannel convolutional neural network. For this article, Movie Review Polarity dataset is used and we classify the review as positive or negative.
Data Preparation
The text data is cleaned using the following way
- Split tokens on white space.
- Remove all punctuation from words.
- Remove all words that are not purely comprised of alphabetical characters.
- Remove all words that are known stop words.
- Remove all words that have a length ≤ 1 character
import string,re
from nltk.corpus import stopwords
import os
from os import listdir
from collections import Counterdef load_doc(filename):
file = open(filename,'r')
text = file.read()
file.close()
return textdef clean_doc(doc):
tokens = doc.split()
text = [t for t in tokens if t not in string.punctuation]
text = [t for t in text if t.isalpha()]
text = [t for t in text if t not in stopwords.words('english')]
text = [t.lower() for t in text]
text = [t for t in text if len(t) > 1]
return text
We use the above method to clean and apply it to all the reviews. process_docs method loops over all the reviews in a directory, clean them and returns a list.
Counter is used to define our vocabulary of known words when using a text model. It is a dictionary mapping of words and their count. Each document is added to the counter using the function add_doc_to_vocab.
def add_doc_to_vocab(filename,vocab):
doc = load_doc(filename)
preprocessed_doc = clean_doc(doc)
vocab.update(preprocessed_doc)def process_docs(directory,vocab):
for filename in listdir(directory):
if filename.startswith('cv9'):
continue
path = directory + '/' + filename
add_doc_to_vocab(path,vocab)
We can save our vocabulary to a new file that we can load later and encode them for modelling.
# save list to file
def save_to_list(lines,filename):
# convert lines to a single blob of text
data = '\n'.join(lines)
# open file
file = open(filename, 'w')
# write text
file.write(data)
# close file
file.close()
Processing all the documents and adding the words to our vocabulary counter.
vocab = Counter()#add all docs to vocab
process_docs('review_polarity/txt_sentoken/pos', vocab)
process_docs('review_polarity/txt_sentoken/neg', vocab)
print(len(vocab)) # 36037
We can also limit our vocabulary by removing all the words that have low occurrence such as used only twice in all reviews which would bring down our vocabulary size from 36037 to 23260 words.
# keep tokens with minimum occurence
min_occurence = 2tokens = [k for k,c in vocab.items() if c >= min_occurence]
print(len(tokens))save_to_list(tokens,'vocab1.txt') # 23260
Loading our vocabulary to use it in order to filter out words from the reviews.
vocab_filename = 'vocab1.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())
Next, we need to load and preprocess all the training data movie reviews using the preprocess_docs function.
def process_docs(directory, vocab, is_train):
documents = list()
# walk through all files in the folder
for filename in listdir(directory):
# skip any reviews in the test set
if is_train and filename.startswith('cv9'):
continue
if not is_train and not filename.startswith('cv9'):
continue
# create the full path of the file to open
path = directory + '/' + filename
# load the doc
doc = load_doc(path)
# clean doc
tokens = clean_doc(doc, vocab)
# add to list
documents.append(tokens)
return documents
The above function is called for both negative and positive directories and then combined into a single train or test dataset. We also label the reviews (0 for negative review and 1 for positive review)
def load_clean_dataset(vocab, is_train):
# load documents
neg = process_docs('review_polarity/txt_sentoken/neg', vocab, is_train)
pos = process_docs('review_polarity/txt_sentoken/pos', vocab, is_train)
docs = neg + pos
# prepare labels
labels = np.array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
return docs, labels
Finally, we save the prepared training and test set to file which can be used later for model evaluation.
from pickle import dump
# save a dataset to file
def save_dataset(dataset, filename):
dump(dataset, open(filename, 'wb'))
print('Saved: %s' % filename)
The below code would clean the text, create labels and save the data for the training set in train.pkl and testing set in test.pkl.
train_docs , ytrain = load_clean_dataset(vocab,True)
test_docs, ytest = load_clean_dataset(vocab,False)# save training datasets
save_dataset([train_docs, ytrain], 'train.pkl')
save_dataset([test_docs, ytest], 'test.pkl')
Multi-Channel Model Creation
The first step is to encode the cleaned training dataset. The function below is used to load to load the pickled training dataset and testing dataset.
from pickle import load
def load_dataset(filename):
return load(open(filename,'rb'))trainLines,trainLabels = load_dataset('train.pkl')
Printing the first document in our cleaned training set
print(trainLines[0])
Keras Tokenizer is used to define the vocabulary for the Embedding layer and encode the documents as integers.
from tensorflow.keras.preprocessing.text import Tokenizer
def create_tokenizer(lines):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
Maximum length of input sequences is provided as input to the model and to pad all sequences to the fixed length. The code below will calculated the number of words for all the reviews in the training set.
max_length = max([len(t) for t in trainLines])
Size of the vocabulary is also calculated as it is an input to the Embedding Layer.
vocab_size = len(tokenizer.word_index) + 1
The movie review text is integer encoded and padded using the encode_text method.
from tensorflow.keras.preprocessing.sequence import pad_sequencesdef encode_text(tokenzier,max_length,docs):
tokens = tokenizer.texts_to_sequences(docs)
padded_seq = pad_sequences(tokens,maxlen=max_length,padding='post')
return padded_seq
Encoding our training and testing set.
Xtrain = encode_seq(tokenizer,max_length,trainLines)
Xtest = encode_seq(tokenizer,max_length,testLines)
Define model
A multi-channel convolutional neural network for document classification involves using different variations of the standard model with different sized kernels. Hence the document is processed at different n-grams at a time which allows the model to learn how to best integrate these interpretations. Although we can experiment with different layers, we focus only on the use of different kernel sizes.
Keras functional API is used to define our multiple-input model. Our model will have 3 input channels for processing 4-grams, 6-grams and 8-grams of movie review text. Each channel is comprised of:
- Input layer that defines the length of input sequences.
- Embedding layer set to the size of the vocabulary and 100-dimensional real-valued representations.
- Conv1D layer with 32 filters and a kernel size set to the number of words to read at once.
- MaxPooling1D layer to consolidate the output from the convolutional layer.
- Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.
The output from the 3 channels are merged into a single vector and processed by a dense layer followed by and output layer.
def build_model(vocab_size,length):
# channel 1
inputs1 = Input(shape=(length,))
embedding1 = Embedding(vocab_size,100)(inputs1)
conv1 = Conv1D(filters=32,kernel_size=4,activation='relu')(embedding1)
drop1 = Dropout(0.4)(conv1)
maxpool1 = MaxPooling1D(pool_size=2)(drop1)
flat1 = Flatten()(maxpool1)
# channel 2
inputs2 = Input(shape=(length,))
embedding2 = Embedding(vocab_size,100)(inputs2)
conv2 = Conv1D(filters=32,kernel_size=6,activation='relu')(embedding2)
drop2 = Dropout(0.5)(conv2)
maxpool2 = MaxPooling1D(pool_size=2)(drop2)
flat2 = Flatten()(maxpool2)
#channel 3
inputs3 = Input(shape=(length,))
embedding3 = Embedding(vocab_size,100)(inputs3)
conv3 = Conv1D(filters=32,kernel_size=8,activation='relu')(embedding3)
drop3 = Dropout(0.5)(conv3)
maxpool3 = MaxPooling1D(pool_size=2)(drop3)
flat3 = Flatten()(maxpool3)
#merge
merged = concatenate([flat1,flat2,flat3])
#Dense layers
Dense1 = Dense(10,activation='relu')(merged)
outputs = Dense(1,activation='sigmoid')(Dense1)
model = Model(inputs = [inputs1,inputs2,inputs3],outputs=outputs)
#compile
model.compile(loss='binary_crossentropy',optimizer='adam',metrics = ['accuracy'])
model.summary()
return model
Building our model.
model = build_model(vocab_size,max_length)
Fitting our model on the training set for 7 epochs with a batch size of 16.
model.fit([Xtrain,Xtrain,Xtrain],trainLabels,epochs=7,batch_size=16)
Model Evaluation
We can evaluate the model by making predictions on the test dataset.
testLines, testLabels = load_dataset('test.pkl')print(testLines[0])
Xtest = encode_seq(tokenizer,max_length,testLines)
Evaluating the model on training and testing dataset.
# evaluate model on training dataset
_, acc = model.evaluate([Xtrain,Xtrain,Xtrain], trainLabels, verbose=0)
print('Train Accuracy: %.2f' % (acc*100))
# evaluate model on test dataset
_, acc = model.evaluate([Xtest,Xtest,Xtest], testLabels, verbose=0) print('Test Accuracy: %.2f' % (acc*100))
Further Improvements
- Different n-grams. Kernel size used by the channels in the model can be changed to see its impact on the model accuracy.
- More or Fewer Channels. In our model we have used 3 channels. It can be increased or decreased to check the impact on the model skill.
- Deeper Network. Convolutional neural networks perform better in computer vision when they are deeper. We can use the same theory in the text classification domain.
- Use GloVe Embedding. Explore loading the pre-trained GloVe embedding and the impact on model accuracy.
Resources