Abstract
Document question answering has witnessed substantial progress with the development of large visionlanguage models. Despite these advances, most existing systems remain limited in their interpretability,
as they typically generate answers without explicitly and precisely identifying the visual evidence on
which those answers are based. This deficiency is particularly pronounced in real-world documents,
where supporting evidence may appear in curved text, be fragmented across multiple spatially separated
regions, be embedded within tables or charts, or exist at different semantic levels such as phrases, lines,
and blocks. Under such conditions, conventional grounding approaches based on coarse bounding boxes
are often insufficient, as rectangular regions fail to capture the irregular and fine-grained structure of
document evidence.
This thesis addresses these limitations through M3Grounder, a novel mask-based framework for
grounded document question answering. In contrast to prior approaches that represent evidence using
rectangular coordinates, the proposed framework formulates grounding as a segmentation problem
and predicts pixel-level evidence masks that are tightly aligned with the answer generation process.
The model autoregressively generates answer text interleaved with special grounding tokens, enabling
explicit correspondence between individual answer spans and their associated supporting regions. This
formulation naturally supports multi-span grounding, in which a single answer may depend on multiple
disjoint regions, and multi-granular grounding, in which evidence may be localized at the phrase, line,
or block level depending on the semantic scope required by the question. This design enables answer
generation interleaved with grounding tokens and hierarchical evidence prediction.
To realize this formulation, the thesis introduces a set of architectural and training mechanisms
designed to improve the precision and faithfulness of document grounding. In particular, M3Grounder
integrates document understanding with segmentation-based evidence prediction, employs a bleedsuppression objective to reduce spillover into irrelevant neighboring regions, and incorporates a hierarchical enclosure mechanism that encourages spatial consistency across phrase-, line-, and block-level masks.
By enforcing containment relationships across these granularities, the framework more faithfully reflects
the hierarchical structure of document layouts and supports more interpretable evidence attribution. Collectively, these components enable finer and more reliable localization than prior box-based grounding
methods, especially in documents containing irregular text geometry or densely structured layouts.
Beyond model design, this thesis also introduces GroundingDocQA, a large-scale data generation
pipeline for grounded document question answering, and GroundingDocQA-Bench, a benchmark for
the joint evaluation of answer correctness and grounding fidelity. The dataset construction pipeline is designed to support a broad range of document types, including text-rich pages, curved-text documents,
and graphics-heavy documents such as charts and infographics. This enables supervision for a wide
variety of grounded reasoning scenarios that extend beyond conventional extractive settings. The
benchmark complements standard answer-centric evaluation by explicitly assessing whether predicted
evidence is spatially accurate, semantically appropriate, and faithful to the generated answer. In this way,
the thesis promotes a more comprehensive evaluation paradigm for document question answering, one
that treats evidence quality as a first-class criterion alongside answer accuracy.
Extensive empirical results demonstrate that mask-based grounding constitutes a more expressive
and effective alternative to coarse region attribution for document question answering. The proposed
framework yields improvements in evidence faithfulness, better captures irregular and spatially distributed
support, and provides stronger grounding performance across challenging document layouts. More
broadly, this thesis argues that the next generation of document question answering systems should not be
evaluated solely by whether they answer correctly, but also by whether they can provide precise, verifiable,
and semantically appropriate evidence for their predictions. In this respect, the thesis advances document
question answering toward more interpretable, trustworthy, and practically deployable system