Abstract
The Web 2.0, through its different platforms such as blogs, social networks, microblogs, or forums
allows users to freely write content on the Internet, with the purpose to provide, share and use information.
It is known that these type of platforms are among the top visited websites1, and their interest
is growing more and more. Given the important role that social media usage is increasingly playing
in daily life, a growing body of literature has emerged in the research community that aims to mine
social media content, or to evaluate the linguistic aspects of that content in order to better understand its
dynamics (Golder et al., 2011) [22].
This user-generated content has a huge drawback, which is the informal nature of the language used.
Non-standard abbreviations, contracted spelling variations, casual grammatical structure are just some
of the aspects of social media language.
Over the past few decades, sociolinguists have been interested in a phenomena called “code mixing”,
which has been observed in social media data. Code-Mixing refers to the embedding of linguistic units
such as phrases, words or morphemes of one language into an utterance of another language. It is
frequently seen in user generated content on social media like Facebook and Twitter, especially by
multilingual users. Apart from the inherent linguistic complexity, the analysis of code-mixed content
poses complex challenges owing to the presence of spelling variations, transliteration and non-adherence
to a formal grammar.
Due to the presence of such data all across social media, there is also a need to understand it. For
any downstream Natural Language Processing task, tools that are able to process and analyse codemixed
data are required. The first steps to understanding this data are language identification and word
normalisation systems, so that we obtain the standard form of a crude sentence from social media.
In this thesis, we have developed a system for language identification and word normalisation for
Hindi-English code-mixed social media text (CMST). We have provided annotation guidelines for our
system, after analysing the complex nature of the dataset used.
Using this system, we have released a dataset of 1446 code-mixed Hindi-English sentences along
with the associated language and normalisation labels. To the best of our knowledge, our work is the first attempt at the creation of an annotated linguistic resource for this language pair, which is also made public.
We have also performed experiments with shallow parsing, in an attempt to build a complete pipeline
from raw data to shallow parsed data. Our pipeline consists of 4 modules - Language Identification,
Normalisation, POS Tagging, and Shallow Parsing. As far as we understand, we are the first to attempt
shallow parsing on code-mixed social media text. This system has been released online.