SENTENCE CORRECTION USING DEEP-LEARNING
Introduction
The name of the project is “Sentence correction using RNN’s”, the work in this project is a pre-processing method for changing text data to conform closer to the distribution of standard English, to increase the performance of State of the art NLP models. The Unprocessed data from the wild might contain random corruptions which might not be ideal inputs for NLP models, these inputs might come from a corrupted language domain which is a superset of target language domain. The model when built will process the corrupted language and translate to target language with the goal of the project is to being able to preserve the properties of text such as sentiment, name entities etc. The new mutated text embeds all the properties of the old text but with standard English.
Example: corrupted language: “Lea u there?”
Standard English: “Lea are you there?”
Objective
The objective here is to build Sentence correction model using RNN with preserving sentiment, name and entities.
Business constraints
1. Not very strict latency constrains but has to convert the corrupted text to normal language within 1–2 seconds
2. Interpretability is not very important
3. Miss-classification cost can be high
Metrics
1. Bleu score will be used as a metric to analyze the text after prediction
2. Log loss will be used as metric while training the Model
Motivation
In 21st century data is the new oil, and one of the most important form of that data is text, Text is one of the most used form of communication on the internet and also elsewhere, apps like WhatsApp and Instagram are almost used by everyone to text each other, but many people do not standard English but rather prefer shortcuts for sentences or short forms for words to text faster. For example, the sentence “How are you” in standard English will be converted into” hw r u” in corrupted language. There are humongous amounts of data in this form and state of the art models trained on standard English will not perform the best on such data.
Data
A detailed review of related literature led to an English corpus of 2000 texts from the National University of Singapore. From my review, it seems that this is the only publicly available normalized corpus for texts.
As a part of preprocessing, it has been noticed that 99.9 percentile of length of the sentences is under 200 but the maximum length is 220.
Baseline model
For the baseline model, I have implemented a word-level unigram attention model with log loss as the loss function. LSTM’s were used to learn the data. As the data points used was not very high and many words were never repeated, learning weights for each word was not a good option hence a character based embedding called Fasttext was used for embedding, this embedding was specifically chosen over the rest because due to character based embedding, even words with spelling mistakes are closer to the actual word.
We can observe from img-1 that even words with spelling mistakes seem to be close to original word.
Code snippet for predict function.
Code snippet for Bleu score.
The bleu score for the baseline model was 0.1361.
Final model
The final model has few changes compared to baseline model. The first and major change is to use 1D-CNN to learn the special patterns of the words. Each kernel is designed to look at a word, and surrounding word(s), and output a value that captures information about the sentence. In this way, the convolution operation extracts feature and patterns in sequential word groupings which indicate the sentiment of the text.
Next change was to replace word embedding’s with character embedding’s for encoder input and decoder input remained the same with word embedding’s
The next change was to use GRU’s instead of LSTM’s with 256 hidden units.
Apart from the changes in the model beam search algorithm was implemented to predict the output sentence to improve the model.
This model has significantly improved from the baseline model with a BLEU score of 0.32.
Other change which was made to improve the results further without changing the existing model was to check all the words which were not present in the train data but appeared in the test and try to find the closest word to that word using enchant dictionary and appending it to the final predicted sentence at the same word location it was found in input sentence. This change has improved the model further with BLEU score of 0.3885
Error Analysis
To understand why the model has performed in such a way I have performed error analysis to understand how the model performs in different situations.
Features
1. Bleu 0–5: bleu score for first 5 words
2. Bleu -5: bleu score for last 5 words
3. Bleu_rem_4: bleu score after removing words which occur more than 4 times in a sentence
4. %known: percentage of values present in training data
In img-3 we can see that the bleu scores are much higher when we just take the first 5 words, this is due to the model predicting the same words in an infinite loop after a certain prediction.
There have been many such predictions where the model goes into an infinite loop, this problem can be solved by using more data to train, but as we are limited to just 2k points this was not possible.
In the image A we can observe that there are many sentences with unknown words compared to image B and it clearly indicates that with unknown words bleu score reduces.
Legend: orange points = bleu 0 -5
Blue points = bleu
Here in img 5 we can observe that most of the orange points are above blue points which signifies that the model seems to work better over short sentences and also it might get a better result if it does not go into an infinite loop while predicting words.
Deployment:
The model was deployed on my local box.
Future works:
As a scope of future work I would like using much more data to improve the results.
If the results seem to be promising, then I will deploy the model online.
Conclusion: This is my Second case-study as a part of a course, this case-study is strictly restricted deep learning. As part of the case study I have learned how to read and implement research papers in deep learning and also how to build custom classes by looking at the diagrams of model architecture. I have learned different ways to improve the model, and also used non machine learning methods to improve the predicted results.
This concludes my work. Thank you for reading!
References:
https://cs224d.stanford.edu/reports/Lewis.pdf
https://arxiv.org/pdf/1709.06429.pdf
Appliedaicourse.com
LinkedIn profile: https://www.linkedin.com/in/alluri-jairam-23a624172/
GitHub profile: https://github.com/allurijairam