Abstract
This thesis deals with the problem of classification, detection and segmentation of objects in images.
We focus on classes of objects which have considerable deformations, with the categories of cats and
dogs as a case study. Many state of the art methods which perform well on the task of detecting a
rigid object category, like bus, airplane, boats etc., have a poor performance on these deformable animal
categories. The well known difficulty in automatically distinguishing between cats and dogs in images,
has been exploited in web security systems: ASIRRA, a system developed by Microsoft Research,
requires users to correctly select all cat images from the 12 cats and dogs images shown to them to
gain access to a web service. Beyond this, the problem of classifying these animals into their breeds is a
challenging problem even for humans. Developing machine learning methods to solve these challenging
problems requires the availability of reliable training data. Here, the popularity of cats and dogs as pets
provides a chance of collecting this data from various sources on the internet where they are often
present in images and videos (together with people). As a part of this work, we propose: a novel method
for detecting cats and dogs in the image; and, a model for classifying images of cats and dogs according
to the species. We also introduce a dataset for fine grained classification of pet breeds, and develop
models to solve the problem of classifying pet images according to the breed. In the process we also
segment these objects.
For detecting animals in an image, we propose a mechanism based on a combination of a template
based object detector and a segmentation algorithm. The template based detector of Felzenszwalb et
al. [43] is used to first detect the distinctive part of the object, and then an iterative segmentation pro-cess extracts the animal by minimizing an energy function based over a conditional random field using
GraphCuts. We show quantitatively that our method works well and substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accu-racy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models
such as bag-of-words.
For the task of subcategory classification, a novel dataset for pet breed discrimination, the IIIT-OXFORD PET dataset is introduced. The dataset contains 7;349 annotated images of cats and dogs of
37 different breeds. These images were selected from various sources on the internet. In addition to the
pet breed, annotations include a pixel level segmentation of the body of each animal, and a bounding
box marking its head. This data set is the first of its kind for pet breed classification and should provide
an important benchmark for researchers working on fine grained classification.
vi
vii
For the classification task, we propose a model to estimate a pet breed automatically from an image.
The model combines shape, captured by a deformable part model detecting the pet face, and appear-ance, captured by a bag-of-words model that describes the pet fur. Two classification approaches are
discussed: in a hierarchical approach a pet is first classified into the dog or cat species, and then clas-sified into its corresponding breed; and a flat one, in which the breed is obtained directly. For the task
of breed classification, on our 37 class dataset, an average accuracy of about 60% was achieved, a very
encouraging result considering the difficulty of the problem. Also, these models are shown to improve
probabilities of breaking the challenging Asirra test by more than 30%, beating all previously published
results