Abstract
Over the last few years, the amount of image and video data present over the internet, and in the
personal collections has been increasing rapidly. Therefore, the need for organizing and searching
these vast collections of data efficiently has also increased. This has led to the research in the areas of
content based retrieval and recognition of scenes/objects in visual data. There has been lot of research
in these areas over the last few years and are still not yet at a deployable stage for real-world usage. For
these technologies to be deployable, these solutions should not only be accurate, but also efficient and
scalable. For all these related problems of visual recognition, there are two major phases, namely feature
extraction stage and learning stage. The feature extraction stage deals with building a representation for
the image/video data and the learning stage deals with learning how a function which can distinguish
the classes. In this thesis, we focus on building efficient methods for visual content recognition and
detection in images and videos. We mainly propose new ideas for the learning stage. For this purpose
we start from using the state-of-the-art techniques and then show how our proposed ideas influence the
computational time and performance.
Firstly, we show the utility of state of the art image representations and classification methods for the
purpose of large scale semantic video retrieval. We demonstrate this on TRECVID 2008 and TRECVID
2009 datasets containing videos for the retrieval of various scenes, objects and actions. We use Support
Vector Machines(SVMs) as classifiers, which have been the popular choice for classification tasks in
many fields. They have become popular mainly because of their good generalization capability. For
obtaining non-linear decision boundaries, SVMs use a kernel function. This kernel function helps in
finding a linear classifier in some high dimensional feature space, without actually computing the higher
dimensional vectors. In many situations, we need to use computationally expensive non-linear functions
as kernels. On the other hand, Linear kernel is computationally inexpensive, however it gives poorer
results in most of the cases.
Another contribution of this thesis is a method for improving the performance of computationally
inexpensive classifiers like linear SVM. For this purpose, we explore the utility of sub-categories, which
are the sub groupings present in the feature space of each semantic class of data. We model these
subcategories by using Structural SVM framework. Also, we analyze how the choice of the groupings
effect the results and present a method to learn the optimal groupings. We investigate our methods on
various synthetic two dimensional datasets and real world datasets namely, VOC 2007 and TRECVID
2008.
vi
vii
Non-linear kernel methods yield state-of-the-art performance for image classification and object de-tection. However, large scale problems require machine learning techniques of at most linear complexity
and these are usually limited to linear kernels. This unfortunately rules out gold-standard kernels such
as the generalized RBF kernels (e.g. exponential-2). All the No-linear kernels help in computing the
inner product in a high dimensional space different from the input space. This helps in overcoming the
explicit computation of the high dimensional vectors. The function which can be used to compute this
high dimensional feature vector is called the feature map. But this feature map is hard to compute and
is very high dimensional. In the literature, explicit feature finite dimensional feature maps have been
proposed to approximate the additive kernels (intersection, 2) by linear ones, thus enabling the use of
fast machine learning technique in a non-linear context. Also, an analogous technique was proposed
for the translation invariant RBF kernels. As a part of this thesis, we complete the construction and
combine the two techniques to obtain explicit feature maps for the generalized RBF kernels. Further-more, we investigate a learning method using l1 regularization to encourage sparsity in the final vector
representation, and thus reduce its dimension. We evaluate this technique on the VOC 2007 detection
challenge, showing when it can improve on fast additive kernels, and the trade-offs in complexity and
accuracy