Abstract
The rapid advancement of Transformer-based models has transformed natural language processing
(NLP), yet their application to low-resource Indic languages remains underexplored, limiting equitable
access to digital knowledge. This thesis investigates the encoding capabilities and robustness of multilingual Transformer models for Indic languages, culminating in practical NLP applications to enhance
knowledge accessibility. By addressing linguistic diversity and digital inclusion, the study aligns with
the urgent need to bridge information gaps for millions of Indic language speakers, particularly in India,
where regional languages like Telugu face significant resource constraints.
Our work begins by introducing INDICSENTEVAL, a novel dataset with approximately 47,000 sentences across six Indic languages—Hindi, Telugu, Marathi, Kannada, Urdu, and Malayalam. This
dataset enables a comprehensive probing analysis of nine multilingual Transformer models (seven universal, two Indic-specific) across eight linguistic properties, including surface, syntactic, and semantic
features. The findings reveal that while models encode English properties consistently, performance for
Indic languages varies, with Indic-specific models outperforming universal ones. Extending this analysis, our study evaluates model robustness against 13 perturbations, such as word dropping and shuffling,
finding that universal models exhibit greater resilience, particularly under noun and verb perturbations.
These insights, detailed in Chapters 3 and 4, highlight model-specific strengths and weaknesses, informing robust NLP system design. Moreover, the development of IndicSentEval and the resulting insights
into model behavior underscore the critical role of digitized linguistic data in enabling effective downstream NLP tasks for low-resource languages.
Building on these findings, the thesis develops a practical application to enhance Telugu Wikipedia,
addressing the knowledge access gap for 95.7 million Telugu speakers. Leveraging NLP techniques
like translation, transliteration, and template generation, the study generates 8,929 high-quality moviedomain articles, adhering to Wikipedia’s standards for word count, images, and infoboxes. This contribution, detailed in Chapter 5, demonstrates the real-world impact of robust NLP systems, fostering
digital inclusion for low-resource language communitie