Abstract
rametric models that represent layout in terms of scene attributes are an attractive avenue for road scene understanding in autonomous navigation. Prior works that rely only on ground imagery are limited by the narrow field of view of the camera, occlusions and perspective foreshortening. In this paper, we demonstrate the effectiveness of using aerial imagery as an additional modality to overcome the above challenges. We propose a novel architecture, Unified, that combines features from both aerial and ground imagery to infer scene attributes. We quantitatively evaluate on the KITTI dataset and show that our Unified model outperforms prior works. Since this dataset is limited to road scenes close to the vehicle, we supplement the publicly available Argoverse dataset with scene attribute annotations and evaluate on far-away scenes. We show both quantitatively and qualitatively, the importance of aerial imagery in understanding road scenes, especially in regions farther away from the ego-vehicle. All code, models, and data, including scene attribute annotations on the Argoverse dataset along with collected and processed aerial imagery, are available