Abstract
Over the last decade, there has been an exponential increase in the popularity of social media platforms such as Facebook, Twitter and other online social networking websites. Apart from offering a fast, easy and reliable method of communication with friends and family, for many people they have also become the primary source of information on current events and happenings around the world. With the rise of these social media networks, it has become extremely easy for anyone and everyone to share text, images and videos with other people. It has also proven to be a boon for creative people, by allowing them to share their work with a lot of people instantly at the press of a button. The advantages of these online social networks are manifold. Unfortunately, some characteristics of user generated content (UGC) poses problems in the way of natural language processing of the data that is shared by users on these websites. UGC typically contains non-standard spellings of words, unusual abbreviations, informal grammatical structure and variations in vocabulary, which are all issues that are not easily mitigated and therefore render current systems ineffective in processing such information.
In recent years, a phenomenon called “code-mixing”, which is frequently observed in online social media, has attracted a lot of research and interest from sociolinguists. A sentence is said to be codemixed if it consists of linguistic units, such as phrases, words or morphemes, from multiple languages. Code-mixing is often observed in the utterances of multilingual speakers, and is prevalent in countries where people speak multiple languages with native proficiency such as India.
The most prominent effect of code-mixing on everyday life can be seen in UGC on online social networks. Due to the informal nature of UGC, code-mixing is bound to happen especially in text written by multilingual writers. Since this hybrid code-mixed language does not have a formally defined grammar, it becomes a challenging task to process this information and analyze it further.
In this thesis, we attempt to understand the phenomenon of code-mixing in a deeper manner by analyzing text from online social media platforms. For deep learning NLP approaches to work on code-mixed text, they require an abundance of appropriate data to train on. Currently, there is a lack of such resources, therefore we start this work with the creation of 6096 English-Hindi code-mixed to monolingual English parallel corpus, which could aid deep learning methods in processing codemixed text. We then proceed to develop a pipeline that can augment existing Machine Translation (MT) systems to enable support for translating code-mixed languages to monolingual. We present the results obtained by experiments conducted that augmented existing statistical and neural network based machine translation systems such as Google Translate, Bing and Moses. This pipeline was also used toaugment the classification of sentiments for code-mixed sentences by existing systems, by translating code-mixed English-Hindi sentences to English using the augmented MT systems and then using an available sentiment analysis system for English.
We take our work on code-mixing further by developing a hybrid architecture model that relies on neural networks with Attention mechanism as well as linguistic features to classify the sentiment labels for code-mixed sentences collected from online social networks such as Facebook and Twitter. We report these results which show that our approach is able to outperform previous methods for this task.
Finally, we demonstrate that a similar hybrid architecture can be used for other NLP tasks as well, such as Clickbait detection and Aggression identification of online social media posts. Our approach imporved on the results obtained by previous methods for the task of clickbait detection. The results of these experiments are also presented in this thesis.