Abstract
Recent advances in large language models (LLMs), vision-language models (VLMs), and text-tovideo (T2V) systems have sparked growing excitement about the future of education. Imagine a learning experience where complex ideas can be explained through synchronized text, visuals, narration,
and video; where educational content can be generated on demand; and where learning becomes more
interactive, accessible, and personalized. While these possibilities are increasingly within reach, an important question remains: do current multimodal AI systems truly understand and connect information
across different modalities in ways that meaningfully support learning? This thesis explores that question through the development of benchmarks and evaluation frameworks for multimodal educational
AI.
A familiar challenge for learners is following presentation slides while simultaneously listening to a
speaker. Slides are often dense, explanations move quickly, and learners can easily lose track of where
to focus their attention. To address this, we explore a method that automatically highlights slide regions
relevant to the ongoing narration, helping guide learners toward the most important visual information
in real time. Supporting this effort, we introduce a dataset designed to evaluate how effectively models
can align spoken language with slide content. The dataset consists of 14 presentation videos and 150
slides, where each slide is paired with its corresponding audio segment and transcript. It captures a wide
range of visual and textual elements--including figures, equations, tables, and diverse layouts--making
it a challenging and realistic benchmark for multimodal alignment.
We then turn to physics education, where visual intuition is often essential for understanding abstract
concepts, yet creating high-quality educational animations remains time-consuming and labor-intensive.
Motivated by recent advances in T2V generation, we investigate whether these models can create videos
that are not only visually realistic, but also pedagogically meaningful. To this end, we introduce PhyEduVideo, a benchmark designed to evaluate the ability of T2V models to generate simple, accurate, and
educational visualizations for physics concepts. Each concept is decomposed into fine-grained teaching
points and paired with carefully crafted prompts intended to elicit clear visual explanations. Beyond
visual quality and motion smoothness, we evaluate whether the generated videos actually convey the
underlying physics correctly. Our findings reveal a striking gap: while current models often produce
visually appealing outputs, their conceptual understanding of physics remains inconsistent.
Finally, we address a broader challenge in scientific communication: knowledge is often scattered
across research papers, slides, presentation videos, and explanatory videos. Although these modalities
frequently convey the same core ideas, they provide complementary perspectives and rarely contain exvi
vii
plicit links between one another. To bridge this gap, we introduce the Multimodal Conference Dataset
(MCD), the first benchmark to integrate all of these modalities from the same set of research works. Using MCD, we evaluate both embedding-based approaches and vision-language models on their ability
to discover fine-grained correspondences across formats. Our experiments show that while VLMs are
generally robust, they still struggle with precise alignment, whereas embedding-based methods effectively capture text-visual relationships but tend to isolate equations and symbolic content into separate
regions of the embedding space. These findings expose important limitations in current multimodal
systems while also highlighting promising directions for future research in multimodal scientific understanding.
Together, these contributions move toward educational AI systems that are not only visually impressive, but also accurate, aligned, and pedagogically meaningful--ultimately enabling richer, more
connected, and more effective learning experiences.