Abstract
                                                                        The Oxford/IIIT team participated in the high-level feature  extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or  audio information.  For the high-level feature extraction task, we used two  different approaches, both based on a combination of visual  features. One used a SVM classifier using a linear combination of kernels, the other used a random forest classifier. For  both methods, we trained all high-level features using publicly available annotations [3]. The advantage of the random  forest classifier is the speed of training and testing.  In addition, for the people feature, we took a more targeted  approach. We used a real-time face detector and an upper  body detector, in both cases running on every frame.  Our best performing submission, C OXVGG 1 1, which  used a rank fusion of our random forest and SVM approach,  achieved an mAP of 0.101 and was above the median for all  but one feature.  In the interactive search task, our team came third overall  with an mAP of 0.158. The system used was identical to last  year with the only change being a source of accurate upper  body detections.