Word-Based Neural Language Modeling using GloVe+LSTM
Neural Language Modelling is the use of neural networks in language modelling. Initially, feedforward neural networks were used however Long Short term memory network or LSTM has become popular as it allows the model to learn relevant context over much longer input sequences than the simpler neural networks.
Word-Based Language modeling is a process in which a statistical model is learned from the input sequence of words and it predicts the next word in the sequence. The length of the input sequence can vary depending on the model accuracy. Our dataset is collected by scraping the Yahoo news website which is demonstrated in this article. We have around 335 articles collected on the topic Brexit. We begin by reading the csv file using pandas read_csv function.
data = pd.read_csv('/content/drive/MyDrive/newsArticle.csv')
data.head()
There are different sequences that can be created such as One-Word-In, One-Word-Out where in one word is provided as an input and using it the model would predict the next word in the sequence. Another way can be to split up the articles line-by-line and consider each line as a sequence. In this article we would be creating Five-Word-In, One-Word-Out sequence.
Data Pre-Processing
We would be cleaning the description of each article in the following way
- Split tokens on white space.
- Remove all punctuation from words.
- Remove all words that are not purely comprised of alphabetical characters.
- Make all the words as lowercase
import stringdef clean_txt(article):
tokens = article.split()
tokens = [t for t in tokens if t not in string.punctuation]
tokens = [t for t in tokens if t.isalpha()]
tokens = [t.lower() for t in tokens]
return tokens
Looping over all the articles present in our dataset and creating a corpus of processed words.
articles = []
description = data['Description']for i in range(len(description)):
art = clean_txt(description[i])
for word in art:
articles.append(word)print(articles)
Printing the number of tokens/words present in our corpus along with the number of unique words. The total words are 11503 out of which only 1918 are unique words.
print('total Tokens',len(articles)) #total Tokens 11503print('total Tokens',len(set(articles))) #total Tokens 1918
Creating sequences which are used for model training. We would be using 5 words in a sequence to predict 1 word. We get around 11497 sequences.
length = 5 + 1
seq = []for i in range(length,len(articles)):
sequence = articles[i-length:i]
lines = ' '.join(sequence)
seq.append(lines)print('Total Sequence',len(seq)) #Total Sequence 11497
The sequences are then encoded as integers where in each word in each sequence is assigned a unique integer. Tokenizer class of Keras can be used to convert the sequences of words to sequences of integers. The first step is fit the tokenizer on our sequences which would develop a mapping from words to integers using fit_on_texts function and then we convert the sequences of texts to sequences of integers using the text_to_sequences function.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(seq)
tokens = tokenizer.texts_to_sequences(seq)
The Embedding layer in our model would require the size of vocabulary in our corpus for one hot encoding the output words. The size can be found using the word_index attribute. we add 1 to it because we need to specify the integer for the largest encoded word as an array index.
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size) # 1919
Sequences need to be split into input and output elements using array slicing. Hence out of the 6 elements in a sequence, first 5 would be considered as input whereas the 6th word is considered as output.
X,y = sequences[:,:-1],sequences[:,-1]
Our model would predict a probability distribution for all the words in our vocabulary. Hence we need to convert our output element from a single integer into a one hot encoded one which would have 0 for every word in the vocabulary and 1 for the actual word. Keras to_categorical function can be used to one hot encode elements.
y = to_categorical(y,num_classes=vocab_size)
Printing the shape of our input would give (11497,5) which states that we have 11497 sequences and 5 words in each sequence. Our output is of the shape (11497,1919) that is it has 11497 sequences and each sequence is one hot encoded.
print(X.shape) #(11497, 5)print(y.shape) #(11497, 1919)
Glove Embedding
We load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.
embeddings_index = dict()f = open('glove.6B.200d.txt',mode='rt',encoding='utf-8')import numpy as np
for line in f:
values = line.split()
words = values[0]
coefs = np.asarray(values[1:],dtype='float32')
embeddings_index[words] = coefs
f.close()print('Loaded word vectors',len(embeddings_index)) #Loaded word vectors 400000
Next, we create a matrix of one embedding for each word in the training dataset by looping over all the unique words and locating their embedding weight vector from the loaded GloVe embedding. Hence we get a matrix of weights only for words in our training set.
embedding_matrix = np.zeros((1244,200))for word,i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if(embedding_vector is not None):
embedding_matrix[i] = embedding_vector
Model Creation
Our model uses an Embedding layer as the input layer. For each word in the vocabulary, the embedding layer has one real valued vector. We specify the weights according to our GloVe embedding and make this layer as non trainable. The model has 2 hidden LSTM layers with 100 units each. The output layer is comprised of one neuron for each word in our vocabulary and uses a softmax activation function. We are modeling a multiclass classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent.
def build_model(vocab_size,seq_length):
model = Sequential()
model.add(Embedding(vocab_size,200,weights=[embedding_matrix],trainable=False,input_length=seq_length))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # summarize defined model
model.summary() return model
Building the model and getting the summary.
model = build_model(vocab_size,seq_length)
The model is fit for 20 training epochs.
model.fit(X,y,batch_size=8,epochs=20)
Plotting our model.
from keras.utils import plot_modelplot_model(model, to_file='LanguageModellingGraph.png')
Generating text
The generate_seq method would take the model, tokenizer, input text and the number of words to predict as input. The input text provided must be processed as above using tokenizer. The function predict_classes is used to get the integer output for the predicted word. Using the mapping in tokenizer we could find out the word and append it to our input.
from tensorflow.keras.preprocessing.sequence import pad_sequencesdef generate_seq(model,tokenizer,seq_length,seed_txt,n_words):
in_txt = seed_txt
result = list()
for _ in range(n_words):
encoded = tokenizer.texts_to_sequences([in_txt])[0]
encoded = pad_sequences([encoded],maxlen=seq_length,truncating='pre')
yhat = model.predict_classes(encoded)
out_word = ''
for word,index in tokenizer.word_index.items():
if index == yhat:
out_word = word
break
in_txt += ' '+out_word
result.append(out_word)
return ' '.join(result),in_txt
Lets get a random sequence which can be used as an input to our model.
from random import randintseed_text = seq[randint(0,len(seq))]print(seed_text)
We have 6 words as expected in our sequence. Lets use this input sequence “the scottish election undermine demand for” to generate another 50 words.
generated,in_txt = generate_seq(model, tokenizer, seq_length, seed_text, 50)print(in_txt)
Our model seems to keep the context for certain word limit. Lets use another input sequence.
from random import randintseed_text = seq[randint(0,len(seq))]print(seed_text)
Generating 50 more words from the input sequence.
generated,in_txt = generate_seq(model, tokenizer, seq_length, seed_text, 50)print(in_txt)
Our model seems to generate meaningful words keeping the context for a certain word limit. However further improvements based on the sequence input-output length can be done along with model improvements to get better contextual output.
Resources