Abstract
                                                                        Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary  cross-modal prediction tasks: (i) predicting text(s) given an image (“Im2Text”), and (ii)  predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the  specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We  pose both these tasks in a retrieval framework. For Im2Text, given a query image, our  goal is to retrieve a ranked list of semantically relevant texts from an independent textcorpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query  text, we aim to retrieve a ranked list of semantically relevant images from a collection of  unannotated images (i.e., images without any associated textual meta-data).  We propose a novel Structural SVM based unified formulation for these two tasks.  For both visual and textual data, two types of representations are investigated. These are  based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one  web-scale) demonstrate that our framework gives promising results compared to existing  models under various settings, thus confirming its efficacy for both the tasks.