Abstract
With the rapid growth of camera-based mobile devices, applications that answer questions such as,
“What does this sign say?" are becoming increasingly popular. This is related to the problem of optical
character recognition (OCR) where the task is to recognize text occurring in images. The OCR problem
has a long history in the computer vision community. However, the success of OCR systems is largely
restricted to text from scanned documents. Scene text, such as text occurring in images captured with a
mobile device, exhibits a large variability in appearance. Recognizing scene text has been challenging,
even for the state-of-the-art OCR methods. Many scene understanding methods recognize objects and
regions like roads, trees, sky in the image successfully, but tend to ignore the text on the sign board.
Towards filling this gap, we devise robust techniques for scene text recognition and retrieval in this
thesis.
This thesis presents three approaches to address scene text recognition problems. First, we propose
a robust text segmentation (binarization) technique, and use it to improve the recognition performance.
We pose the binarization problem as a pixel labeling problem and define a corresponding novel energy
function which is minimized to obtain a binary segmentation image. This method makes it possible
to use standard OCR systems for recognizing scene text. Second, we present an energy minimization
framework that exploits both bottom-up and top-down cues for recognizing words extracted from street
images. The bottom-up cues are derived from detections of individual text characters in an image.
We build a conditional random field model on these detections to jointly model the strength of the
detections and the interactions between them. These interactions are top-down cues obtained from a
lexicon-based prior, i.e., language statistics. The optimal word represented by the text image is obtained
by minimizing the energy function corresponding to the random field model. The proposed method
significantly improves the scene text recognition performance. Thirdly, we present a holistic word
recognition framework, which leverages scene text image and synthetic images generated from lexicon
words. We then recognize the text in an image by matching the scene and synthetic image featureswith our novel weighted dynamic time warping approach. This approach does not require any language
statistics or language specific character-level annotations.
Finally, we address the problem of image retrieval using textual cues, and demonstrate large-scale
text-to-image retrieval. Given the recent developments in understanding text in images, an appealing
approach to address this problem is to localize and recognize the text, and then query the database, as in
a text retrieval problem. We show that this approach, despite being based on state-of-the art methods, is
insufficient, and propose an approach without relaying on an exact localization and recognition pipeline.
We take a query-driven search approach, where we find approximate locations of characters in the text
query, and then impose spatial constraints to generate a ranked list of images in the database.
We evaluate our proposed methods extensively on a number of scene text benchmark datasets,
namely, street view text, ICDAR 2003, 2011 and 2013, and a new dataset IIIT 5K-word, we introduced,
and show better performance than all the comparable methods. The retrieval performance is
evaluated on public scene text datasets as well as three large datasets, namely, IIIT scene text retrieval,
Sports-10K and TV series-1M, we introduced.