Abstract
The present growth of digitization of books and manuscripts demands an immediate
solution to access them electronically. This will enable the archived valuable materials
to be searchable and usable by users in order to achieve their objectives. This requires
research in the area of document image understanding, specically in the area of
document image recognition as well as document image retrieval. In the last three
decades signicant advancement is made in the recognition of documents written in
Latin-based scripts. There are many excellent attempts in building robust document
analysis systems in industry, academia and research labs. Intelligent recognition
systems are commercially available for use for certain scripts. However, there is
only limited research eort for the recognition of indigenous scripts of African and
Indian languages. In addition, diversity of archived printed documents poses many
challenges to document analysis and understanding. Hence in this work, we explore
novel approaches for understanding and accessing the content of document image
collections that vary in quality and printing.
In Africa around 2,500 languages are spoken. Some of these languages have their
own indigenous scripts in which there is a bulk of printed documents available in the
various institutions. Digitization of these documents enables us to harness already
available language technologies to local information needs and developments. We
present an OCR for converting digitized documents in Amharic language. Amharic is
the ocial language of Ethiopia. Extensive literature survey reveals that this is the
rst attempt that reports the challenges toward the recognition of indigenous African
scripts and a possible solution for Amharic script. Research in the recognition of
Amharic script faces major challenges due to (i) the use of large number of characters
in the writing and (ii) the existence of large set of visually similar characters. Here
we extract a set of optimal discriminant features to train the classier. Recognition
results are presented on real-life degraded documents such as books, magazines and
newspapers to demonstrate the performance of the recognizer.
The present OCRs are typically designed to work on a single page at a time. We argue that the recognition scheme for a collection (like a book) could be considerably
dierent from that designed for isolated pages. The motivation here is therefore to
exploit the entire available information (during the recognition process), which is not
eectively used earlier for enhancing the performance of the recognizer. To this end,
we propose self adaptable OCR framework for the recognition of document image
collections. This approach enables the recognizer to learn incrementally and adapt
to document image collections for performance improvement. We employ learning
procedures to capture the relevant information available online, and feed it back to
update the knowledge of the system. Experimental results show the eectiveness
of our design for improving the performance of the recognizer on-the-
y, thereby
adapting to a specic collection.
For indigenous scripts of African and Indian languages there is no robust OCR avail-
able. Designing such a system is also a long-term process for accessing the archived
document images. Hence we explore the application of word spotting approach for
retrieval of printed document images without explicit recognition. To this end, we
propose an eective word image matching scheme that achieves high performance in
presence of script variability, printing variations, degradations and word-form varia-
tions. A novel partial matching algorithm is designed for morphological matching of
word form variants in a language. We employ a feature extraction scheme that ex-
tracts local features by scanning vertical strips of the word image. These features are
then combined based on their discriminatory potential. We present detailed experi-
mental results of the proposed approach on English, Amharic and Hindi documents.
Searching and retrieval from document image collections is challenging because of
the scalability issues and computational time. We design an ecient indexing scheme
to speed up searching for relevant document images. We identify the word set by
clustering them into dierent groups based on their similarities. Each of these clusters
are equivalent to a variation in printing, morphology, and quality. This is achieved by
mapping IR principles (that are widely used in text processing) for relevance ranking.
Once we cluster list of index terms that dene the content of the document, they
are indexed using inverted data structure. This data structure also provides scope
for incremental clustering in a dynamic environment. The indexing scheme enables
eective search and retrieval in image-domain that is comparable with text search
engines. We demonstrate the application of th