Abstract
Understanding document content requires not only recognizing textual characters but also capturing
textual attributes such as bold, italic, underline, and strikeout, which convey emphasis, hierarchy, and
semantic intent. These attributes play a critical role in real-world documents, including legal records,
administrative forms, educational materials, and historical manuscripts, where formatting often encodes
meaning that cannot be inferred from text alone. Despite their importance, Textual Attribute Recognition
(TAR) has received limited attention compared to traditional Optical Character Recognition (OCR), and
existing approaches remain inadequate for handling the complexity of real-world, multilingual, and
layout-rich documents.
A key limitation of prior TAR methods lies in their treatment of words as isolated visual units. Attribute recognition, however, is inherently context-dependent, as distinctions such as bold versus regular
text or underline versus structural lines often require comparison with neighboring words or surrounding layout elements. Methods that ignore this context suffer from ambiguity and reduced accuracy.
Conversely, approaches that attempt to incorporate full-document context using transformer-based architectures face significant computational challenges, due to the quadratic complexity of attention over
large numbers of word instances. These issues are further exacerbated in multilingual settings, where
script diversity, font variations, and document noise introduce additional variability.
This thesis addresses these limitations through TexTAR, a context-aware, multi-task Transformer
framework for scalable and robust Textual Attribute Recognition. The central idea is to achieve an
effective balance between contextual modeling and computational efficiency by operating on compact
local contexts rather than entire documents. To this end, we introduce a Context Window (CW) based
data selection pipeline, in which each target word is processed together with a small set of its spatially
nearest neighbors, selected using a weighted Chebyshev distance. This formulation enables the model to
capture essential contextual cues required for attribute discrimination, while avoiding the computational
overhead associated with global attention.
Building on this representation, TexTAR employs a transformer architecture that processes sequences
of word images within each context window. To effectively encode spatial relationships and document
layout, the model incorporates a two-dimensional Rotary Positional Embedding (2D RoPE)-style mechanism, allowing it to model both horizontal and vertical positional dependencies. This design enables
the network to reason about relative positioning and alignment patterns that are crucial for distinguishing visually similar attributes. Furthermore, TexTAR is trained in a multi-task learning setting, jointly predicting multiple groups of attributes, which improves generalization and robustness across diverse
document types and formatting styles.
In addition to the model architecture, this thesis introduces MMTAD (Multilingual Multi-domain
Text Attribute Dataset), a large-scale dataset comprising over a million word-level annotations across
multiple languages and document domains. The dataset captures a wide range of real-world variability, including different scripts, layouts, font styles, and noise conditions. To address the inherent class
imbalance in attribute distributions, targeted data augmentation strategies are employed to simulate
attribute-specific variations, enabling the model to learn more balanced and discriminative representations.
Extensive experimental evaluation demonstrates that TexTAR consistently outperforms existing approaches across multiple benchmarks and settings. The results highlight the importance of contextaware modeling for accurate attribute recognition, particularly in challenging scenarios involving ambiguous visual cues or complex layouts. At the same time, the proposed context window formulation
ensures that the model remains computationally efficient and scalable to large document collections.
More broadly, this thesis argues that document understanding systems must move beyond text recognition toward richer representations that incorporate visual attributes as first-class signals. By integrating
efficient context modeling, spatially-aware transformer architectures, and large-scale multilingual supervision, TexTAR advances the state of the art in Textual Attribute Recognition and provides a practical
foundation for downstream applications such as document digitization, structured information extraction, and accessibility. In doing so, it contributes toward more comprehensive and semantically aware
document intelligence systems.