Abstract
As text processing systems become increasingly central to human-computer interactions—such as
voice assistants, chatbots, and search engines—it is vital that these technologies support natural, flexible, and inclusive modes of communication. Language usage is far from uniform and varies significantly
across regions, cultures, and communities. A particularly widespread phenomenon in multilingual societies is code-mixing, where speakers fluidly incorporate linguistic units from two or more languages
within a single utterance or conversational context. This form of language usage is especially common
in countries like India, where multilingualism is the norm. However, contemporary NLP systems often
struggle with such data because code-mixed data is under-represented in conventional sources of text
data. Therefore, there is an urgent need to develop robust NLP pipelines that can handle code-mixed
input effectively and equitably.
This thesis addresses key challenges in processing code-mixed text by proposing comprehensive
solutions for its analysis, benchmarking, and generation. Research presented in this thesis advances the
field through several core contributions.
Syntactic Code-Mixing Metric (SyMCoM): We propose, SyMCoM, a novel metric designed to quantify syntactic code-mixing by analyzing the source language associated with each part-of-speech (PoS)
tag in a sentence. This metric offers a linguistically grounded alternative to existing language ID based
metrics, enabling deeper insights into the structure of code-mixed text and facilitating more nuanced
analysis of corpora and systems.
Acceptability of Code-Mixed Text: We introduce CLINE, a large-scale dataset containing English –
Hindi code-mixed sentences annotated with human judgments on their acceptability. Through empirical
analysis, we show that existing code-mix metrics often fail to distinguish between acceptable and unacceptable code-mixing. Further, we demonstrate that multilingual language models like XLM-Roberta
and Llama, when fine-tuned appropriately, can effectively learn to model these human acceptability
judgments, offering a path forward for more human-aligned code-mix evaluation.
A Cross-Lingual Task-Oriented Dialogue Dataset for Hindi and English-Hindi: We develop and release Hindi and English–Hindi versions of a multi-domain, task-oriented dialogue dataset. This dataset
supports both natural language understanding (NLU) and generation (NLG) tasks and provides a valuable benchmark for evaluating multilingual and code-mixed capabilities of large language models in
realistic, task-based scenarios.Model Merging Strategies for Code-Mixed Scenarios: To improve performance on code-mixed tasks,
we explore novel strategies for adapting pre-trained models for code-mixed tasks. We also evaluate
the possibility of combining monolingual and code-mixed data through model merging. Results of
our study indicate that merging models offers an effective approach to adapting pre-trained models
for code-mixed tasks, frequently yielding better performance than the traditional method of continued
pre-training followed by fine-tuning.
CodeMixToolkit: Finally, we present the CodeMixToolkit, a modular and extensible framework that
standardizes the pipeline for working with code-mixed data. Toolkit offers tools for accessing and
preprocessing data, model training, and evaluation, and supports multiple NLP tasks, with a focus on
English–Hindi, but is extendable to other language pairs. This resource aims to accelerate research and
development in the area by standardizing the pipeline in Code-Mixing research, lowering entry barriers,
and promoting reproducibility.
We conclude by discussing the limitations of our approaches, including challenges related to data
availability, generalization across domains, and evaluation frameworks. We also outline future directions
for advancing code-mixed NLP, such as improved representation learning, transfer learning techniques,
and deeper linguistic analysis. This thesis presents a comprehensive framework for processing codemixed languages, enabling the development of NLP systems that are capable of processing code-mixed
text, and better suited to the needs of multilingual users.