Abstract
Video highlights are a selection of the most interesting parts of a video. The problem of highlight de-tection has been explored for video domains like egocentric,sports, movies, and surveillance videos. Existing methods are limited to finding visually important parts of the video but does not necessarily learn semantics. Moreover, the available benchmark datasets contain audio muted, single activity,short videos, which lack any context apart from a few keyframes that can be used to understand them. In this work,we explore highlight detection in the TV series domain, which features complex interactions with the surroundings. The existing methods would fare poorly in capturing the video semantics in such videos. To incorporate the importance of dialogues/audio, we propose using the descriptions of shots of the video as cues to learning visual importance.Note that while the audio information is used to determine visual importance during training, the highlight detection still works using only the visual information from videos.We use publicly available text ranking algorithms to rank the descriptions. The ranking scores are used to train a visual pair wise shot ranking model (VPSR) to find the highlights of the video. The results are reported on TV series videos of the Video Set dataset and a season of Buffy the Vampire SlayerTV series.