IIIT

Learning-Based Variable Impedance Control for Coordinated Dual-Arm Manipulation

Author(s): Bollimuntha Shreya
Advisor(s): K Madhava Krishna

Masters

June '26
Report no: IIIT/TH//
Center of RRC

Abs PDF

Learning-Based Variable Impedance Control for Coordinated Dual-Arm Manipulation

Abstract

This thesis develops and analyzes an adaptive framework for coordinated dual-arm manipulation that
integrates reinforcement learning with optimization-based variable impedance control. Building on the
DA-VIL approach, the work addresses the core challenge of dual-arm tasks: maintaining stable, precise
coordination under changing object properties and strongly coupled arm-object dynamics.
The proposed pipeline uses a learned policy (trained with PPO) to predict time-varying Cartesian
stiffness parameters, and an optimization (QP)-based controller that converts those impedance settings
into joint accelerations and torques while enforcing physical constraints (joint limits, velocity/torque
bounds, and end-effector geometry).
The thesis presents design choices for the policy and reward structure (including an EMA regularizer
for smooth stiffness trajectories), the QP formulation for simultaneous impedance and postural tasks,
and extensive MuJoCo evaluations on a diverse set of large objects with varying masses and geometries.
Quantitative results show consistent reductions in object trajectory-tracking error and improved robustness compared to baseline controllers (fixed impedance, optimization-only, and straightforward
RL+IC). Ablations highlight the importance of smooth stiffness predictions and the combined RL+QP
architecture. The thesis concludes with a discussion of limitations, potential extensions to assembly and
obstacle-aware planning, and plans for sim-to-real transfer and hardware validation.

Intelligibility Estimation and Articulatory Subsystem Analysis for Telugu Post-Stroke Dysarthric Speech

Author(s): Gouravarapu Shanmukha Sai Keerthi
Advisor(s): Chiranjeevi Yarra

Masters

June '26
Report no: IIIT/TH//
Center of LTRC

Abs PDF

Intelligibility Estimation and Articulatory Subsystem Analysis for Telugu Post-Stroke Dysarthric Speech

Abstract

Stroke often results in motor speech disorders such as dysarthria, leading to reduced speech intelligibility and impaired communication. Accurate assessment of speech intelligibility and identification of
underlying articulatory deficits are essential for effective rehabilitation. However, existing approaches
largely rely on subjective evaluation or global scoring, offering limited support for targeted therapy
planning, particularly in under-resourced languages such as Telugu.
This work presents an objective and interpretable framework for assessing speech intelligibility and
articulatory impairments in Telugu-speaking post-stroke individuals. A phonemically structured speech
corpus was developed, comprising recordings from 13 stroke participants along with perceptual intelligibility ratings obtained from human evaluators. In addition, speech data from 25 age-matched healthy
participants were collected as part of a pilot study to establish a reference dataset. Word-level intelligibility was estimated using Dynamic Time Warping (DTW) applied to embeddings derived from the
Wav2Vec2-Large-XLSR-53-Telugu model, and validated against listener judgments.
To enable detailed analysis beyond global scoring, a minimal pairs based articulatory mapping framework was proposed. By analyzing phonological contrasts such as place of articulation and voicing, the
framework identifies consistent phoneme-level errors and maps them to probable articulatory subsystems. This approach provides a meaningful link between acoustic speech deviations and underlying
physiological mechanisms.
In addition, a combined analysis of word-level and passage-level speech was performed to capture
both articulatory precision and performance in conversational speech. The results highlight the heterogeneous nature of post-stroke speech impairment, with deficits often localized to specific phoneme
groups or arising from coordination challenges in continuous speech. The findings also emphasize the
importance of considering dialectal variation in speech assessment.
Beyond overall intelligibility assessment, the proposed framework enables phoneme-level analysis
of speech errors. By identifying consistently misarticulated phonemes and mapping them to underlying
articulatory subsystems, it provides interpretable insights that can support targeted therapy planning. In
summary, proposed approach employs phonemic-level assessment to identify speech impairments and
provides articulator specific insights, enabling therapists to design targeted interventions using phoneticlevel measures. This approach contributes toward the development of data driven and personalized
rehabilitation strategies for dysarthric speech in Telugu.

Modelling Polycrisis in the Western Himalayas: Interdependent Risks and Systemic Vulnerabilities

Author(s): Aishani Pandey
Advisor(s): Aniket Alam

Masters

June '26
Report no: IIIT/TH//
Center of CEH

Abs PDF

Modelling Polycrisis in the Western Himalayas: Interdependent Risks and Systemic Vulnerabilities

Abstract

This thesis examines polycrisis as a condition of interdependent and mutually reinforcing crises, focusing on the Western Himalayan region encompassing Himachal Pradesh and Uttarakhand. In recent
years, the term polycrisis has gained prominence to describe situations in which environmental, economic, political, and social stresses overlap in ways that blur causal boundaries and complicate policy
responses. While existing studies have extensively analysed individual crises in the Western Himalayas
such as climate change impacts, agricultural stress, migration, and disaster risk, these processes are often
examined in isolation. This thesis argues that such siloed approaches are insufficient for understanding
the region's systemic vulnerability.
The central contribution of this work is the development of a structured modeling framework that
represents polycrisis as an interconnected system rather than a collection of parallel challenges. Our
approach begins by identifying four main domains of environmental, economic, political and social
aspects of polycrisis. Within these, we select key variables (nodes) such as climate change, migration,
agricultural productivity, and disaster preparedness and governance; each broken down into concomitant
sub-factors (subnodes). Using an extensive review of peer-reviewed literature, government reports, and
region-specific case studies, the thesis systematically maps both first-order relationships between nodes
and the subnodes and second-order relationships between subnodes themselves. This two-level mapping
reveals feedback loops and cascading pathways that are often obscured in domain-specific analyses.
Across both mapping iterations, that is first-order relationships (node ↔ subnode) being the first iteration and second-order relationships (subnode ↔ subnode) being the second iteration, the thesis identifies and documents approximately 120 relationships, which are categorised using a five-point ordinal
scale ranging from Very Strong to Negligible. This categorisation is grounded in empirical evidence
and conceptual studies centered around Western Himalayas or comparable mountain ranges. The resulting network highlights the centrality of climatic variables, the mediating role of governance, and the
reinforcing interactions between agricultural stress and migration.
The framework serves as a proof of concept for simulating cascading effects across socio-ecological
systems and informing anticipatory policy responses in mountainous regions under stress by addressing
how multiple, co-occurring crises interact across environmental, economic, political, and social systems
to constitute a regional polycrisis, and proposes a systematic model to represent and quantify these
linkages.

Voices of Dissent: A Multimodal Computational Framework for the Analysis of Protest Music

Author(s): Utsav Shekhar
Advisor(s): Radhika Mamidi

Masters

June '26
Report no: IIIT/TH//
Center of LTRC

Abs PDF

Voices of Dissent: A Multimodal Computational Framework for the Analysis of Protest Music

Abstract

Music has historically functioned as a primary vehicle for political expression and transforming personal struggles into collective narratives of resistance. While the cultural and historical significance of
protest music is well documented, its specific linguistic and acoustic signatures, the quantifiable features that differentiate it from general music remain largely underexplored. This thesis presents Voices
of Dissent, a comprehensive multimodal computational framework designed to analyze and characterize protest music through the dual lenses of natural language processing and audio analysis. To address
the scarcity of specialized resources, we first curate a novel, balanced multimodal dataset comprising
458 protest songs (sourced via Wikidata) and 370 non protest songs (matched by time period and genre
diversity using GPT-4 inference). Our methodology follows a multi step approach: first, we extract
linguistic and audio features including lexical diversity, rhyme density, sentiment polarity, spectral flux
etc. to isolate the stylistic dimensions of dissent in lyrics and audio. Second, we evaluate a suite of transformer based models, to assess the performance of deep embeddings in capturing protest intent across
text and audio modalities. A key technical contribution of this work is the application of source separation to decompose audio tracks into vocal and accompaniment stems. By conducting controlled mixing
experiments, we quantify the relative influence of the "message" (vocal delivery) versus the "medium"
(instrumental arrangement) in the perception of protest. Our results indicate that protest lyrics exhibit
significantly higher repetition and more negative valence compared to non protest counterparts. Furthermore, text based models (91.10% F1-score) slightly outperform audio-based models (90.62% F1-score),
though the high performance of both suggests that protest signals are deeply embedded in both modalities. To validate these computational findings on the audio side, we conduct a human annotation study
involving 50 participants, revealing a strong correlation between automated feature extraction and human perception of musical energy, vocal roughness, and melodic disjunctness. This research provides
empirical grounding for the study of protest music, offering a scalable framework for researchers in
musicology, sociolinguistics, and digital activism to analyze the evolving language of resistance in the
digital age.

Wearable Plantar Pressure Sensing for Surface Characterization and Hardness Mapping

Author(s): Jampana Koundinya Varma
Advisor(s): Aftab M. Hussain

Masters

June '26
Report no: IIIT/TH//
Center of CVEST

Abs PDF

Wearable Plantar Pressure Sensing for Surface Characterization and Hardness Mapping

Abstract

Wearable plantar pressure measurement systems have become important in biomechanics, rehabilitation, and gait analysis. By capturing foot-ground interaction data, these systems enable researchers and
clinicians to evaluate balance, diagnose pathological gait, design personalised footwear, and monitor patient recovery. Existing research in this domain has predominantly focused on extracting insights about
the human body, with applications centred on musculoskeletal health, motor control, and performance
assessment. However, despite the rich information embedded in plantar pressure signals, their potential
to characterise properties of the external environment, particularly the surfaces on which humans walk,
has received little to no attention.
This thesis addresses this unexplored dimension by investigating whether plantar pressure data can
be leveraged to infer surface characteristics, explicitly focusing on hardness. Unlike conventional approaches to surface analysis, which rely on specialised equipment or destructive testing, wearable plantar pressure systems provide a low-cost, portable, and unobtrusive means of measurement. The central
research question guiding this work is whether pressure patterns generated by the foot during locomotion can be systematically mapped to the mechanical properties of the underlying surface.
To explore this question, we developed a low-cost in-shoe sensor array capable of capturing plantar
pressure distributions in real time. Experiments were conducted by collecting walking trials across
surfaces of varying hardness, ranging from soft foam and natural ground to rigid concrete and tiled
flooring. The recorded plantar pressure profiles were then analysed to identify distinctive signatures
associated with different surface types. Statistical and machine learning techniques were employed to
classify surfaces based on the extracted features, enabling quantitative assessment of surface hardness
from wearable data.
The results demonstrate that surface hardness exerts a measurable influence on plantar pressure distribution, and that these differences can be reliably captured through the proposed in-shoe system. Our
findings establish, for the first time, that wearable plantar pressure measurements can be repurposed for
surface characterisation. This represents a conceptual shift in how wearable sensing technologies are
perceived, broadening their utility from human biomechanics to context-aware environmental interaction.
The implications of this work extend across several domains. In rehabilitation and sports science,
knowledge of surface properties can inform safer training and recovery protocols. In robotics, surface
characterisation through human-robot shared sensing could improve terrain adaptation strategies. From an urban perspective, large-scale deployment of such systems could support real-time monitoring of
ground conditions in public spaces. Overall, this thesis contributes a methodological innovation and a
new application space for wearable plantar pressure systems, bridging the gap between human-centered
sensing and surface characteristics.

Word Memorability in Cued Recall: Cue-Target Differences and Semantic Structure

Author(s): Anuska Maity
Advisor(s): Vishnu Sreekumar

Masters

June '26
Report no: IIIT/TH//
Center of CSL

Abs PDF

Word Memorability in Cued Recall: Cue-Target Differences and Semantic Structure

Abstract

Memorability, the tendency for an item to be consistently remembered across individuals, is often
treated as an intrinsic property of stimuli, reflecting fixed characteristics of items themselves. However, models of cue-based retrieval predict that memorability should also depend on the functional role
an item plays during retrieval. Using a large cued-recall dataset in which each word served both as a
cue and as a target, we tested this prediction directly. Cue memorability showed substantially greater
cross-participant reliability than target memorability, indicating that the cognitive processes initiating retrieval are more consistent across individuals than those involved in recovering specific memory traces.
To identify the sources of this dissociation, we modeled cue and target memorability using a broad set
of psycholinguistic and semantic features. Number of senses, concreteness, and word frequency selectively supported cue effectiveness, whereas number of incoming associates and morphemic complexity
predicted target retrievability. However, at the level of semantic categories, cue and target memorability converged almost perfectly, consistent with prior work (1). This result suggests that category-level
structure captures stable conceptual regularities that shape memorability across retrieval roles. Together,
these findings suggest that memorability reflects an interaction between semantic structure and retrieval
processes, with distinct lexical properties supporting the effectiveness of items as cues versus their retrievability as targets

Extraction, Representation and Analysis of Spatial Knowledge in Domain Specific Texts

Author(s): Nitin Ramrakhiyani
Advisor(s): Vasudeva Varma Kalidindi

PhD

May '26
Report no: IIIT/TH//
Center of LTRC

Abs PDF

Extraction, Representation and Analysis of Spatial Knowledge in Domain Specific Texts

Abstract

Domain specific narrative texts are seen in education, governance, healthcare, legal and multiple
other domains. These texts encompass different kinds of general and domain-specific knowledge. To
make these niche narrative texts understandable and usable both by humans and machines, they need to
be processed through Natural Language Processing (NLP) mechanisms and made conducive for use in
desired domain specific applications. A specific kind of knowledge is spatial knowledge. It is useful for
enabling the fundamental cognitive skill of spatial understanding that underpins essential human interactions with the environment. Spatial knowledge enables one to perform basic navigation and orientation,
such as planning routes, understanding and explaining locations of objects and people, and describing
objects by their spatial features - shape, size, and appearance. Pre-trained Language Models (PLMs)
have fundamentally transformed NLP by serving as powerful, generalized language processors trained
on massive datasets. Their primary importance lies in achieving state-of-the-art performance across
diverse NLP tasks, often without the need for extensive task-specific retraining.
The work done as part of the thesis combines the above three aspects and comprises of automatic
extraction, representation and analysis of spatial knowledge in domain specific narratives with the help
of PLMs. Further, the work includes demonstration of the utility of the extracted knowledge in niche
use-cases. More specifically, the work is divided into two major aspects - PLMs as reasoners of spatial
knowledge and PLMs as containers of spatial knowledge. As part of the first aspect, we consider
an existing knowledge representation, the Message Sequence Chart (MSC) and enrich it with spatial
knowledge through a novel task of motion reasoning. It is through this task of motion reasoning, we
check how PLMs function as reasoner of dynamic spatial knowledge. As part of the second aspect, we
gauge PLMs for their familiarity of a specific kind of spatial knowledge - facts in world geography. We
construct a moderately sized dataset of facts related to 24 different geographical relations and check
multiple encoder and decoder PLMs through a mask filling exercise. Additionally, we enrich these
PLMs through fine-tuning to enhance their understanding of this particular kind of spatial knowledge.
The major research contributions of the thesis include (i) proposition of the MSC as a narrative
representation formalism, (ii) enrichment of the MSC with spatial knowledge for enabling it as a spatiotemporal knowledge representation, (iii) proposition of the novel task of motion reasoning under the
sub-area of spatial reasoning, (iv) the masked sentences dataset for the effort on gauging geography
knowledge in PLMs and its enrichment, and (v) proposition of novel applications namely crime report
gap finding, enhancing spatial fidelity in the filmstrip representation, niche tourism search and wikidata
augmentation. For several of these contributions, the developed datasets and code have been made available publicly for reproducibility as well as encouraging further research. Furthermore, a subset of
work in the thesis was carried out as part of a collaboration project between my organization - TCS
Research and IIIT - Hyderabad. The code/data artifacts and learning obtained as part of the PhD have
been employed in several Proof-of-Concepts (PoCs) developed and showcased for clients and internal
teams at TCS.

Stochastic Gravitational Waves From Graviton Bremsstrahlung in Inflaton Decay into Massive Spin 3/2 Particles

Author(s): Mihika Sanghi
Advisor(s): Diganta Das

Masters

May '26
Report no: IIIT/TH//
Center of CCNSB

Abs PDF

Stochastic Gravitational Waves From Graviton Bremsstrahlung in Inflaton Decay into Massive Spin 3/2 Particles

Abstract

The detection of a stochastic gravitational wave (GW) background would provide a direct probe of
the early universe, offering unique insights into the high-energy physics of the inflationary era. This
thesis investigates the production of primordial gravitational waves during the post-inflationary reheating period, specifically focusing on the graviton bremsstrahlung accompanying the perturbative decay
of the inflaton into pairs of massive spin-3/2 particles. Such spin-3/2 fields, described by the RaritaSchwinger formalism, are physically motivated candidates for dark matter and arise naturally in supergravity theories as gravitinos.
Modeling the inflaton as a classical field coherently oscillating in a polynomial potential of the form
V (ϕ) ∼ ϕ
k
(with k ≥ 2), we derive the decay rates for both the standard two-body decay (ϕ → ψµψµ)
and the three-body gravitational bremsstrahlung (ϕ → ψµψµhµν). We perform a fully numerical analysis of the reheating dynamics by solving the coupled Boltzmann equations for the inflaton, radiation,
and gravitational wave energy densities.
Our results demonstrate that the resulting GW spectrum exhibits a characteristic multi-peak structure,
which arises from the distinct contributions of the inflaton's harmonic oscillation modes. We identify
a hierarchy in the spectral contributions of these harmonics that is preserved across different potential
shapes. Although the predicted signal amplitude currently lies below the sensitivity curves of proposed
detectors such as the Einstein Telescope (ET) and Big Bang Observer (BBO), the spectral features offer
a theoretical framework for constraining the microscopic physics of inflation and the properties of the
hidden sector.

Rethinking Scale in Low-Resource ASR

Author(s): Srihari Bandarupalli
Advisor(s): Kavita Vemuri

Masters

May '26
Report no: IIIT/TH//
Center of

Abs PDF

Rethinking Scale in Low-Resource ASR

Abstract

Automatic Speech Recognition (ASR) has witnessed rapid progress with the advent of self-supervised
learning and large-scale multilingual models. However, ASR for morphologically complex and lowresource languages remains challenging due to limited labeled data, orthographic inconsistencies, tokenization difficulties, and the computational demands of large-scale systems. Perso-Arabic languages
such as Persian, Arabic, and Urdu exemplify these difficulties, where script variations, rich morphology, complex word formation, and inconsistent text normalization complicate both preprocessing and
subword modeling for ASR systems.
Current state-of-the-art approaches largely rely on scaling model size and data volume, often overlooking the importance of linguistic structure, script-aware tokenization, and domain relevance. For
low-resource settings, scale-driven strategies are computationally prohibitive and frequently suboptimal.
This thesis investigates whether efficient and competitive ASR performance can instead be achieved
through linguistically informed preprocessing, morphologically-aware tokenization, and targeted crosslingual continual pretraining, without dependence on excessively large models.
We first introduce a unified script-aware preprocessing pipeline tailored to Perso-Arabic languages,
addressing normalization, tokenization, phonetic parsing, and text cleaning to reduce orthographic noise
and improve data consistency. Building upon this foundation, we construct a scalable multilingual unlabeled speech corpus of approximately 3,000 hours across Persian, Arabic, and Urdu. Using this corpus,
we systematically evaluate continual pretraining strategies under different initialization conditions and
integrate SentencePiece-based subword modeling within a CTC-based Wav2Vec framework, carefully
adapting token segmentation to handle word boundaries and repeated character collapse. All downstream fine-tuning and evaluation are conducted in a monolingual setting to isolate transfer effects.
Our results demonstrate that a 300M parameter model, when adapted through targeted continual
pretraining and morphologically-aware tokenization, achieves performance competitive with systems
over five times larger. Notably, the proposed model outperforms Whisper Large v3 on Persian and yields
strong results on Arabic and Urdu despite using substantially fewer parameters and labeled resources.
These findings challenge the prevailing assumption that ASR quality scales primarily with model
size. Instead, we show that data relevance, adaptation strategy, and linguistically informed tokenization
play a more critical role in low-resource scenarios. This thesis provides a practical and resource-efficient
pathway toward competitive ASR for Perso-Arabic languages and offers a reproducible experimental
template that may inform future multilingual extensions.

Textual Attribute Recognition for Documents

Author(s): Jinka Jyothi Swaroopa
Advisor(s): Ravi Kiran Sarvadevabhatla

Masters

May '26
Report no: IIIT/TH//
Center of CVIT

Abs PDF

Textual Attribute Recognition for Documents

Abstract

Understanding document content requires not only recognizing textual characters but also capturing
textual attributes such as bold, italic, underline, and strikeout, which convey emphasis, hierarchy, and
semantic intent. These attributes play a critical role in real-world documents, including legal records,
administrative forms, educational materials, and historical manuscripts, where formatting often encodes
meaning that cannot be inferred from text alone. Despite their importance, Textual Attribute Recognition
(TAR) has received limited attention compared to traditional Optical Character Recognition (OCR), and
existing approaches remain inadequate for handling the complexity of real-world, multilingual, and
layout-rich documents.
A key limitation of prior TAR methods lies in their treatment of words as isolated visual units. Attribute recognition, however, is inherently context-dependent, as distinctions such as bold versus regular
text or underline versus structural lines often require comparison with neighboring words or surrounding layout elements. Methods that ignore this context suffer from ambiguity and reduced accuracy.
Conversely, approaches that attempt to incorporate full-document context using transformer-based architectures face significant computational challenges, due to the quadratic complexity of attention over
large numbers of word instances. These issues are further exacerbated in multilingual settings, where
script diversity, font variations, and document noise introduce additional variability.
This thesis addresses these limitations through TexTAR, a context-aware, multi-task Transformer
framework for scalable and robust Textual Attribute Recognition. The central idea is to achieve an
effective balance between contextual modeling and computational efficiency by operating on compact
local contexts rather than entire documents. To this end, we introduce a Context Window (CW) based
data selection pipeline, in which each target word is processed together with a small set of its spatially
nearest neighbors, selected using a weighted Chebyshev distance. This formulation enables the model to
capture essential contextual cues required for attribute discrimination, while avoiding the computational
overhead associated with global attention.
Building on this representation, TexTAR employs a transformer architecture that processes sequences
of word images within each context window. To effectively encode spatial relationships and document
layout, the model incorporates a two-dimensional Rotary Positional Embedding (2D RoPE)-style mechanism, allowing it to model both horizontal and vertical positional dependencies. This design enables
the network to reason about relative positioning and alignment patterns that are crucial for distinguishing visually similar attributes. Furthermore, TexTAR is trained in a multi-task learning setting, jointly predicting multiple groups of attributes, which improves generalization and robustness across diverse
document types and formatting styles.
In addition to the model architecture, this thesis introduces MMTAD (Multilingual Multi-domain
Text Attribute Dataset), a large-scale dataset comprising over a million word-level annotations across
multiple languages and document domains. The dataset captures a wide range of real-world variability, including different scripts, layouts, font styles, and noise conditions. To address the inherent class
imbalance in attribute distributions, targeted data augmentation strategies are employed to simulate
attribute-specific variations, enabling the model to learn more balanced and discriminative representations.
Extensive experimental evaluation demonstrates that TexTAR consistently outperforms existing approaches across multiple benchmarks and settings. The results highlight the importance of contextaware modeling for accurate attribute recognition, particularly in challenging scenarios involving ambiguous visual cues or complex layouts. At the same time, the proposed context window formulation
ensures that the model remains computationally efficient and scalable to large document collections.
More broadly, this thesis argues that document understanding systems must move beyond text recognition toward richer representations that incorporate visual attributes as first-class signals. By integrating
efficient context modeling, spatially-aware transformer architectures, and large-scale multilingual supervision, TexTAR advances the state of the art in Textual Attribute Recognition and provides a practical
foundation for downstream applications such as document digitization, structured information extraction, and accessibility. In doing so, it contributes toward more comprehensive and semantically aware
document intelligence systems.