IIIT

DESIGN AND CHARACTERIZATION OF RADIO FREQUENCY (RF) BASED BIOSENSORS FOR LAB ON CHIP (LOC) APPLICATIONS

Author(s): Swarnim Sinha
Advisor(s): Azeemuddin Syed

Masters

June '25
Report no: IIIT/TH//
Center of CVEST

Abs PDF

DESIGN AND CHARACTERIZATION OF RADIO FREQUENCY (RF) BASED BIOSENSORS FOR LAB ON CHIP (LOC) APPLICATIONS

Abstract

Radio Frequency (RF) biosensors have emerged as a powerful technology for biochemical detection, offering label-free, non-invasive, and real-time sensing capabilities. These sensors detect changes in the electromagnetic properties of a medium, typically through shifts in resonance or impedance. This thesis primarily focuses on the design and development of RF biosensors based on Interdigitated Capacitors (IDCs), which are highly sensitive to variations in the dielectric properties of the surrounding environment. IDCs consist of multiple interdigitated electrodes forming a planar capacitor, where the electric field interacts strongly with the dielectric material between the fingers. Changes in the dielectric properties caused by biochemical substances lead to measurable shifts in the sensor’s resonant frequency or quality factor, enabling sensitive detection. To further enhance sensor performance, this work explores the integration of IDC with Complementary Split-Ring Resonators (CSRRs). CSRRs are resonant structures exhibiting sharp resonance characteristics that improve both sensitivity and selectivity. By incorporating the IDC within or near the CSRR, the sensor benefits from increased electric field concentration and stronger coupling, resulting in a compact, cost-effective device with superior detection capabilities. The design process for the IDC-CSRR sensor begins with selecting a suitable substrate material, such as Rogers or FR4, chosen for their dielectric properties and low loss. IDC parameters-including finger number, width, length, and spacing-are optimized to tailor capacitance and sensitivity. The CSRR is designed with precise gaps that establish a resonant frequency sensitive to dielectric changes. The integrated IDC-CSRR structure is then simulated using full-wave electromagnetic solvers like ANSYS HFSS to optimize resonance characteristics and quality factor. Finally, the sensor is fabricated and experimentally validated through biochemical detection tests, demonstrating its effectiveness in detecting subtle dielectric variations. This thesis also presents a comprehensive review of existing IDC and CSRR-based biosensors and includes studies on IDC-based biochemical detection using grade 4 Whatman filter paper and PLA-based walls, as well as CSRR-based antenna designs on FR4 substrates. The findings contribute to the advancement of high-performance, compact RF biosensors with potential applications in healthcare, environmental monitoring, and food safety

Resource and Power-Efficient Lane Detection Techniques for Cost-constrained Vehicular Systems

Author(s): Rhuthik P
Advisor(s): Deepak Gangadharan

Masters

June '25
Report no: IIIT/TH//
Center of CSG

Abs PDF

Resource and Power-Efficient Lane Detection Techniques for Cost-constrained Vehicular Systems

Abstract

The rapid advancement of automotive technology, particularly in Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles (AVs), has significantly enhanced vehicle safety and driver comfort. Central to these advancements is the task of lane detection, which enables vehicles to perceive their position on the road accurately. However, deploying robust lane detection systems faces significant hurdles, especially on the resource-constrained edge computing platforms typically found within vehicles, and critically within Electric Vehicles (EVs) where energy conservation is paramount for maximizing driving range. Existing lane detection solutions often present a stark trade-off. Deep Learning (DL) models offer superior accuracy and robustness across diverse conditions but demand substantial computational power and energy. Conversely, traditional Image Processing (IP) algorithms are lightweight and energyefficient but frequently struggle with performance in challenging scenarios like poor lighting or complex road markings. Relying solely on either approach is often suboptimal, leading to either compromised safety due to low accuracy or reduced operational range due to high energy consumption. This necessitates intelligent strategies that can effectively manage the accuracy-resource trade-off in real-time. This thesis addresses these challenges by proposing and evaluating efficient and adaptive frameworks specifically designed for lane detection on resource-constrained vehicular platforms. The contributions aim to improve both the inherent efficiency of lane detection processing and the dynamic management of available detection methods. The first contribution introduces S 2P, a novel two-stage superpixel algorithm designed to enhance traditional IP-based lane detection methods like Hough Transform (HT) and Probabilistic Hough Transform (PHT). By intelligently reducing the image area requiring processing through static and dynamic superpixel mapping, S2P significantly improves the accuracy and recall of these lightweight methods (e.g., improving PHT accuracy from 0.89 to 0.95) while maintaining or even reducing power consumption, making them more viable for low-power edge devices.The second contribution presents RL-ALMS, an adaptive model selection framework employing contextual multi-armed bandits and Thompson sampling. RL-ALMS dynamically chooses between different available lane detection models (ranging from lightweight IP to computationally intensive DL) based on real-time context, including road conditions, system performance history, and crucially, the vehicle’s remaining battery state. It incorporates a novel battery-aware reward function and sustainability constraints to intelligently balance immediate detection accuracy with long-term energy preservation, ensuring journey completion. Experimental results demonstrate RL-ALMS’s effectiveness, achieving high average accuracy (90.3%) comparable to always using a DL model, while preserving significant battery reserves (81.2% after 50km) and adapting robustly to varying conditions, significantly outperforming static high-accuracy or energy-efficient baseline strategies. Both proposed solutions are evaluated using relevant datasets (TuSimple) and hardware platforms (Raspberry Pi, Jetson Nano), demonstrating their practical potential for enhancing lane detection capabilities in modern vehicles.

ML-Accelerated Optimization Strategies for CMOS & FinFET Circuits

Author(s): Ritwik Basyas Goswami
Advisor(s): Zia Abbas

Masters

June '25
Report no: IIIT/TH//
Center of CVEST

Abs PDF

ML-Accelerated Optimization Strategies for CMOS & FinFET Circuits

Abstract

The rapid evolution of semiconductor technologies, marked by the transition to FinFETs and increasingly complex CMOS nodes, has intensified challenges in circuit design. Process variations, aging effects, and stringent performance requirements demand innovative strategies to balance precision, reliability, and computational efficiency. This thesis explores machine learning (ML)-driven methodologies to address these challenges, focusing on scalable optimization frameworks for next-generation integrated circuits. Central to this work is the development of advanced machine learning (ML) models suited for circuit data modeling. A chained architecture is introduced to capture process-induced variations in leakage power, voltage, and current across diverse technology nodes and operating conditions. By integrating temperature and voltage fluctuations into its design, this framework significantly enhances prediction accuracy, enabling robust modeling of nanoscale devices across tech-nodes. Complementing this, a novel automated system leverages ML to streamline analog circuit optimization, reducing manual iterations and simulation overhead while maintaining compliance with designer specifications. The thesis further proposes optimization strategies accelerated by ML to address transistor sizing and aging mitigation. A multi-objective evolutionary algorithm framework is designed to rapidly optimize digital logic cells, achieving substantial improvements in power-delay metrics without reliance on traditional simulator-heavy workflows. For FinFET-based systems, an aging-aware optimization methodology dynamically adjusts device parameters to counteract reliability degradation from NBTI and HCI effects, ensuring sustained performance across operational lifetimes. These approaches collectively bridge the gap between computational efficiency and design robustness. By unifying ML-driven modeling with intelligent optimization, this research advances the integration of automation in electronic design. The proposed frameworks not only enhance circuit reliability and efficiency but also empower designers to navigate the complexities of advanced CMOS and FinFET technologies. These contributions lay a foundation for adaptive, high-performance systems in applications ranging from IoT devices to industrial electronics, underscoring the transformative potential of machine-learning (ML) in modern VLSI design.

Efficient Physically Based Rendering with Analytic and Neural Approximations

Author(s): K T Aakash Ajit
Advisor(s): Narayanan P J

PhD

June '25
Report no: IIIT/TH//
Center of CVIT

Abs PDF

Efficient Physically Based Rendering with Analytic and Neural Approximations

Abstract

Path tracing is ubiquitous for photorealistic rendering of various real-world appearances. It follows the principles of light transport adapted from physics, which describe light propagation as a set of integral equations. These equations are stochastically evaluated by tracing light rays in virtual scenes. Such stochastic evaluations with ray-tracing form the bulk of the path tracing algorithm, which is widely used in the industry. Stochastic evaluations in path tracing converge to the correct answer in time that is inversely pro- portional to the square root of the number of iterations. This coupled with the fact that the underlying integrals are often complex and high dimensional results in large compute complexity. Research efforts have thus largely focused on accelerating path tracing by improving the stochastic sampling processes. However, it is interesting to look at efficient analytic approximations by making reasonable assumptions on the nature of these light transport integrals. Such analytic methods have the potential to achieve zero variance at the outset. Practically, they are often used in conjunction with stochastic methods thereby achieving lower variance than the fully stochastic counterparts. The primary focus of this thesis is to develop new (semi-)analytic methods and improve existing ones to accelerate direct lighting computations in path tracing. We base our research on the theory of Linearly Transformed Cosines (LTC) applied for direct lighting from area lights. The LTC method produces plausible renderings by building on the principles of light transport from the ground up and has proved useful for tasks other than real-time rendering We make the following three contributions that either build on LTCs or improve it. We first explore fully-analytic direct lighting for arbitrarily shaped area lights, built on LTCs at the core. Due to assumptions of the LTC method, it can only handle polygonal area lights. Furthermore, rendering shadows with LTCs require stochastic evaluations - our contribution here relaxes these as- sumptions, enabling fully-analytic direct lighting with shadows from an arbitrary shaped area light. We show that our method achieves plausible and noise-free renderings compared to semi-analytic LTCs and ground truth ray-tracing, given equal compute budget. Our second contribution improves the LTC formulation by relaxing an assumption that restricts their usage to isotropic GGX reflection distribution functions. Isotropic GGX restricts the kind of reflectance that can be modeled, ultimately restricting the visual fidelity of rendering with LTCs. We identify the core issues that prevent the use of anisotropic GGX, and propose solutions for each. Our contribution not only benefits (semi-)analytic rendering with LTCs but also other related methods that use LTCs at their core.In our third contribution, we extend the theory of Linearly Transformed Spherical Distributions (LTSDs), which are a superset of and core to LTCs, to work with phase functions. This enables us to analytically compute in-scattered radiance, which we build on to semi-analytically render single scattering. Like the original LTC method, we ground our derivations and formulations on the Volume Rendering Equation (VRE) which paves the way for plausible photorealistic renderings despite the bi- ased nature of our method. In effect, our work enables semi-analytic volumetric rendering with area lights. Another focus of this thesis is to accelerate path tracing computations with neural approximations. Path tracing poses a challenge for (semi-)analytic methods as the underlying integrals become increas- ingly more complex and high-dimensional. In this setting, it is beneficial to use a neural network to ap- proximate parts of these complex integrals for improved efficiency. Previous works have demonstrated the usage of neural networks in this manner, however their main focus was on scenes with relatively shorter light paths. We instead focus on scenes where longer paths are required for accurate rendering. More specifically, we tackle the problem of accelerating multiple scattering computations in hair using neural networks. The contributions of this thesis range from exploring analytic and semi-analytic methods to neural approximations for physically based rendering. Moreover, our contributions advance the state of the art, in terms of efficiency, fidelity and capability. We believe our work can inspire further research in computer graphics and adjacent fields.

Enhancing and Interpreting Decision Transformers for Continuous Control and Multi-Discrete Action Spaces

Author(s): Dhillu Thambi
Advisor(s): Praveen Paruchuri

Masters

June '25
Report no: IIIT/TH//
Center of MLL

Abs PDF

Enhancing and Interpreting Decision Transformers for Continuous Control and Multi-Discrete Action Spaces

Abstract

Decision Transformers (DT) have emerged as a powerful offline reinforcement learning architecture, demonstrating substantial successes across various complex tasks. Despite their impressive performance, the interpretability and adaptability of these models remain areas requiring significant exploration, particularly when applied to continuous control and multi-discrete action environments. This thesis provides a comprehensive study of Decision Transformers by addressing these two critical aspects: interpretability in continuous control environments and adaptability through novel tokenisation strategies in multi-discrete action spaces. In the first part, we perform a detailed interpretability analysis of Decision Transformers trained on MuJoCo environments, focusing on continuous control tasks known for their complexity and realism. We utilize multiple interpretability techniques such as positional encoding (PE) analysis, return-to-go (RTG) examination, embedding and hidden state analyses, attention visualizations, and perturbation studies. Our results reveal crucial insights: positional encoding significantly impacts performance only in environments demanding precise temporal coordination (e.g., Hopper and Walker2d), whereas in simpler, velocity-based tasks (HalfCheetah), performance remains stable without explicit positional encoding. Additionally, RTG analysis indicates that the model’s ability to achieve specified returns is closely linked to the distribution of training data rewards, highlighting its adaptability. Embedding analysis uncovers the hierarchical abstractions and structured representations learned by the models, and attention visualization reveals the nuanced role of attention mechanisms in guiding behavior. Perturbation studies further emphasize the robustness of Decision Transformers, identifying key action dimensions (joints) critical for successful task execution. The second part of the thesis addresses Decision Transformer performance limitations in environments characterized by multi-discrete action spaces, specifically image-based domains like ViZDoom. Here, we introduce Multi-State Action Tokenisation (M-SAT), an innovative method that enhances decision-making performance by tokenizing actions at the individual action level and incorporating auxiliary state information. M-SAT’s tokenization strategy significantly improves agent performance and interpretability within attention layers, thereby providing clearer insights into agent decisions and enhancing transparency in dynamic environments involving complex, multi-discrete action choices. Crucially, we demonstrate that M-SAT can achieve superior performance in challenging scenarios such as Deadly Corridor, My Way Home, and Death Match without the necessity of positional encoding, occasionally even benefiting from its removal. The granular action tokenization approach employed by M-SAT facilitates more efficient learning and enables detailed interpretability of individual actions, showcasing marked improvements in sample efficiency and adaptability over baseline Decision Transformers. Through the integration of these two studies, this thesis presents a unified narrative emphasizing the necessity and benefits of interpretability and innovative tokenization strategies in Decision Transformers. Our insights are critical for the future development of more interpretable, adaptable, and reliable transformer-based decision-making systems capable of excelling across diverse and complex environments. This combined analysis deepens our understanding of the internal mechanisms driving Decision Transformer performance, setting the groundwork for further advancements in reinforcement learning architectures.

Seer: A Framework for Optimizing Traffic Camera Placement and Deep Learning Inference at the Edge for Vehicle Path Reconstruction

Author(s): Siddhant Jain
Advisor(s): Venkata Suresh Reddy Purini

Masters

June '25
Report no: IIIT/TH//
Center of CSG

Abs PDF

Seer: A Framework for Optimizing Traffic Camera Placement and Deep Learning Inference at the Edge for Vehicle Path Reconstruction

Abstract

Integrating traffic cameras with deep learning facilitates real-time multi-camera vehicle path reconstruction, benefiting smart city initiatives such as traffic management and public safety. The high demands on latency, bandwidth, and privacy underscore the necessity of edge computing platforms. However, the continuous operation of compute-intensive deep learning algorithms on round-the-clock camera streams can exceed the compute and power capacities of edge data centers. Therefore, it is critical to strategically deploy and activate camera streams to balance compute demands with performance requirements. Currently, there is a lack of tools to support this planning process. This thesis presents Seer, a comprehensive suite of tools and algorithms that aids traffic planners in optimizing traffic camera placement and deep learning model deployment, considering compute and budgetary constraints. Seer encompasses (1) a graph-based algorithm for efficient camera placement incorporating road network topology, (2) a collaborative algorithm for vehicle path reconstruction that enables the use of less accurate but resource-efficient deep learning models, and (3) a scalable traffic simulation framework derived from open-source projects that model city road networks. Evaluations of Seer on two real-world road networks, each with thousands of intersections and simulated vehicle paths, show a 6.3x reduction in compute requirements. This is achieved through a combination of sparse camera deployment and the use of lightweight models, with an average 55% increase in the Hausdorff distance for the path reconstruction algorithm, translating to an absolute increase of 135 to 195 meters.

Learning Improved 3D Representations for Robotic Perception

Author(s): Gaurav Singh
Advisor(s): K Madhava Krishna

Masters

June '25
Report no: IIIT/TH//
Center of RRC

Abs PDF

Learning Improved 3D Representations for Robotic Perception

Abstract

Understanding 3D structure from sparse visual observations is a foundational capability required for robotic perception. In many real-world settings, objects are observed in arbitrary poses, under occlusion, and from limited viewpoints. Inferring complete 3D shape and appearance from such inputs enables downstream tasks like robotic manipulation, scene understanding, and 3D generation. This thesis explores representation learning methods that operate over both explicit and implicit 3D representations to recover structured geometry from minimal input, enabling category-level generalization across unseen instances in challenging conditions. To enable robust 3D inference in such scenarios, we first focus on the challenge of completing object geometry from partial point cloud observations when objects appear in arbitrary orientations. Traditional shape completion methods assume inputs are aligned to a canonical frame, which limits applicability in robotics where objects may be present in non-canonical poses. To address this, we propose SCARP—a method for Shape Completion in ARbitrary Poses. SCARP disentangles shape and pose using a multi-task formulation; it learns rotation-equivariant features for pose estimation and rotation-invariant features for shape reasoning. The network jointly predicts a canonical full point cloud and its 6D pose. Unlike multi-stage pipelines that depend on external canonicalization, SCARP is a single network that learns to complete shapes directly in the observed pose. We demonstrate SCARP’s utility in robotic grasping by showing that completing shapes before grasp prediction reduces invalid and colliding grasps by over 70%, and outperforms prior methods by 45% on completion metrics across several categories. While point clouds offer a compact and geometric view of 3D structure, they are inherently sparse and do not capture appearance or high-frequency detail. To model richer, dense 3D structure directly from images, we turn to implicit neural fields—specifically, Neural Radiance Fields (NeRFs). We introduce HyP-NeRF, a framework for learning category-level priors over Neural Radiance Fields (NeRFs). Our key insight in HyP-NeRF is to use a hypernetwork to not just parameterize a neural network (NeRF) but also generate learnable input encodings, resulting in learning higher frequency details while reducing the computation cost. The hypernetwork predicts both the NeRF MLP weights as well as parameters of a multi-resolution hash encoding(MRHE) of a NeRF conditioned on an instance code, enabling efficient synthesis of NeRFs across unseen objects in a class. To overcome rendering artifacts and improve fidelity, we propose a denoising and finetuning pipeline that leverages a learned 2D denoiser followed by view-consistent NeRF optimization. This formulation supports diverse downstream tasks, including single-view NeRF generation, text-to-NeRF synthesis via CLIP, and retrieval from real-world images. Experiments on high-resolution (512x512) object datasets show that HyP-NeRF achieves state-of-the-art performance in generalization, compression, and instance retrieval, while maintaining photorealistic quality and fast inference. In summary, this thesis presents methods for learning robust and generalizable 3D structure from sparse inputs. SCARP addresses the challenge of shape completion at arbitrary poses through rotationaware modeling of partial point clouds, while HyP-NeRF develops scalable priors over neural fields for instance-conditioned NeRF generation. Together, these approaches offer complementary pathways toward unified, learnable 3D perception in robotic environments. Moreover, the ideas and methods of representation learning presented in these papers have the scope of being scalable and being integrated in future 3D foundation models which require the sufficient inductive biases to be able to account and compensate for the limited availability of training data.

Multilingual Humor Generation through Hindi-English Code-Mixed Puns

Author(s): Likhith Asapu
Advisor(s): Manish Shrivastava

Masters

June '25
Report no: IIIT/TH//
Center of LTRC

Abs PDF

Multilingual Humor Generation through Hindi-English Code-Mixed Puns

Abstract

Puns, as a form of wordplay, play a significant role in humor, language comprehension, and cultural expression. They rely on phonetic and semantic ambiguities to create humor and are widely used in entertainment, advertising, and literature. While computational approaches to pun generation and detection have advanced in English, pun generation in code-mixed languages remains largely unexplored. Code-mixing, the blending of linguistic elements from multiple languages within a single utterance, presents unique challenges due to its syntactic and semantic complexities. As multilingual communication continues to evolve, developing computational models that can understand and generate humor in such settings is increasingly important. This thesis presents a comprehensive study on the generation of puns in Hindi-English code-mixed text. We begin by constructing a resource of pun-alternate word pairs, which serve as the foundation for pun generation. These pairs are collected by experimenting with various phonetic similarity matching strategies designed to identify humorously interchangeable words across Hindi and English. Using these pairs, we generate puns and analyze their linguistic plausibility and humor potential. We evaluate several pre-trained multilingual language models for their effectiveness in generating syntactically and semantically coherent code-mixed text. Based on these insights, we propose several novel structured prompt-based methods for generating puns in the Hindi-English code-mixed setting. To assess the quality of generated puns, we design a comprehensive human evaluation framework and collect detailed annotations across multiple dimensions of humor and fluency. Human evaluation suggests that our proposed methods significantly outperform baseline approaches in terms of humor quality and contextual relevance. The annotated outputs from our evaluations form the Hindi-English Code-mixed Pun (HECoP) dataset, comprising 2000 human-annotated sentences. We leverage this dataset to compare various multilingual models on the task of pun detection. This resource provides a valuable benchmark for future research in multilingual pun generation and humor detection. Beyond pair-based methods, we further introduce a structured pun generation pipeline capable of generating puns from a single input word without relying on predefined pun-alternate lists. This pipeline integrates phonetic similarity analysis, compatibility scoring, and sentence filtering to enhance the coherence and humorousness of generated content. To the best of our knowledge, this is one of the first comprehensive computational studies focused on pun generation in code-mixed or low resource setting. The methodologies, evaluation frameworks,and datasets introduced in this work lay a strong foundation for future advancements in computational humor and multilingual NLP.

Towards Trustworthy AI: Frameworks for Evaluating Consistency in Language Models

Author(s): Vamshi Krishna Bonagiri
Advisor(s): Ponnurangam Kumaraguru

Masters

June '25
Report no: IIIT/TH//
Center of C2S2

Abs PDF

Towards Trustworthy AI: Frameworks for Evaluating Consistency in Language Models

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet they exhibit critical inconsistencies that fundamentally undermine their reliability in real-world applications. This thesis addresses two fundamental challenges in LLM reasoning: the evaluation and enhancement of consistency in moral and logical reasoning tasks. We first tackle moral consistency evaluation, where traditional accuracy-based methods fail due to the subjective nature of moral reasoning. We introduce SaGE (Semantic Graph Entropy), an information-theoretic framework that quantifies moral consistency by analyzing the semantic coherence of underlying “Rules of Thumb” (RoTs) inferred from LLM responses. To support this evaluation, we construct the Moral Consistency Corpus (MCC), containing 50,000 moral reasoning instances across diverse scenarios. Our comprehensive evaluation reveals widespread moral inconsistencies across state-of-the-art LLMs, with the maximum observed SaGE score being only 0.681, indicating substantial reliability concerns. We then investigate logical consistency, focusing on the pronounced difficulties LLMs encounter when reasoning with counterfactual premises that conflict with their parametric knowledge. Through the CounterLogic benchmark, a systematically designed dataset spanning 9 formal inference schemas, we demonstrate substantial performance degradation (27% on average) when models reason against their parametric knowledge compared to knowledge-consistent scenarios. To address these logical consistency challenges, we propose Self-Segregate, a metacognitive intervention inspired by human cognitive strategies for handling conflicting information. This two-phase prompting technique first assesses the factual alignment of premises before performing logical reasoning, enabling epistemic compartmentalization. Self-Segregate significantly reduces counterfactual reasoning performance gaps from 27% to 11% while improving overall logical accuracy by 7.5% across multiple models and tasks. Our findings establish consistency as a critical dimension of LLM performance that is orthogonal to accuracy, revealing that models can achieve high task performance while remaining fundamentally unreliable. This thesis contributes essential methodologies for developing more robust and trustworthy language models through novel evaluation frameworks, systematic benchmarks, and effective intervention strategies.

Framing Bias: AI and Eye-Tracking Decode Objectification in Cinematography

Author(s): Parth Maradia
Advisor(s): Kavita Vemuri

Masters

June '25
Report no: IIIT/TH//
Center of CSL

Abs PDF

Framing Bias: AI and Eye-Tracking Decode Objectification in Cinematography

Abstract

Cinema, as a pervasive cultural medium, wields profound influence over societal perceptions, yet its role in perpetuating objectification remains underexplored in non-Western contexts. This thesis bridges this gap by investigating how cinematic techniques in Indian ”item songs”—a genre marked by provocative choreography and strategic camera framing—to induce objectifying gaze behavior in viewers. Integrating this with computer vision model, we present a dual-methodological approach: an empirical eye-tracking study and a novel multi-modal deep learning framework for understanding and detecting visual objectification in videos. In the experimental component, 91 participants viewed sexualized (SV) and non-sexualized (TV) music videos while their gaze metrics—fixation duration, visit counts, and scanpaths—were recorded. Results revealed that sexualized framing significantly redirected attention toward objectified body regions (torso, lower body), with gaze synchronization rates 6× higher in SV than TV (p < 0.001). Dynamic segmentation and ScanGraph analyses demonstrated that camera techniques such as close-ups and rapid editing overrode individual differences, homogenizing gaze patterns across viewers. These findings empirically validate theoretical frameworks like objectification theory (Fredrickson & Roberts, 1997) and the male gaze (Mulvey, 1975), highlighting how exogenous cinematic cues force sexual gaze objectification. Complementing this, our computational contribution introduces an interpretable multi-modal AI framework. By fusing video (LLaVA-NeXT-Video-7B-hf), audio (Whisper-large-v2), and text (allmpnet-base-v2) embeddings via contrastive learning, the model quantifies objectification intensity by dynamically weighting multi-modal cues, achieving state-of-the-art objectification detection (F1: 0.783, Acc: 0.826). A concept bottleneck mechanism further links predictions to human-interpretable cinematic elements (e.g., ”male gaze framing,” AUC: 0.803). This work advances interdisciplinary research by quantifying the cognitive impact of cultural media practices and providing scalable tools for bias detection caused by directors. Its implications extend to AI-driven content moderation, policy frameworks, ethical cinematography, and cross-cultural studies of media effects, establishing a foundation for mitigating objectification in increasingly visual digital ecosystems.