Abstract
Parsing natural scenes into semantically meaningful entities is one of the open problems in computer
vision. Due to complexity present at various levels ranging from scene-object, object-object and objectlayout,
the problem becomes challenging. Restricting the problem to indoor scenes makes it slightly
easier to track. Holistic understanding of indoor scene involves detection of object, recovery of 3D
geometry of the object, estimating spatial layout of the scene and classification of the indoor scene.
It helps in executing the high-level task such as navigation, free space estimation, object placement
and manipulation. In this thesis, we integrate the information at the various levels in the cluttered
indoor environments for efficient semantic segmentation. We use appearance and geometric properties
of different entities for estimating free space and localising objects in the given indoor scene. We believe
that this work can enable a variety of applications where the semantic understanding of indoor scene is
required. For example, mobility assessment, robot navigation, path planning and surveillance, object
manipulation, grasping, learning object support order, visual search and 3D reconstruction.
In this thesis, we first attempt the problem of learning and estimating free space i.e., floor regions
in indoor scenes from a single image. Estimating free space is challenging due to high appearance
variability within the floor and non-floor regions. It is even harder to segment floor regions when clutter,
specular reflections, shadows and textured floors are present within the scene. We propose a framework
which utilises a generic classifier of appearance cues as well as floor density estimates, both trained from
a variety of indoor images. The result of the classifier is then adapted to a specific test image where we
integrate appearance, position and geometric cues in an iterative framework. A Markov Random Field
framework is used to integrate the cues to segment floor regions. The proposed approach is also flexible
in situations where scene avoids assumptions like Manhattan world scene or restricting clutter only to
wall-floor boundaries.
Moving from detecting free space or floor regions, we use the appearance and geometric properties to
estimate more general entities in the cluttered scene. These entities or objects are the basic units of any
indoor scene, free space being one of them. For example different configurations of the set of objects
in scenarios ranging from indoor scenes with cluttered tabletops to indoor scenes of offices, home,
corridors, classrooms etc. Due to high variability in indoor scenes itself and complexity due to intraclass
variability within objects, we restricted our attention to table top scenarios with known objects.
The problem of estimating the layout of table top scenes is challenging due to the presence of clutter,
objects of homogeneous appearance with that of the table surface, object-object occlusions and objects having irregular shapes and sizes. We train an ensemble of classifiers over appearance cues from various
images of the known objects with different poses. We learn the meta-data (pose, shape) associated with
each object and try to estimate its pose and shape in a given cluttered scene. The approach predicts the
detailed layout of the objects present and free space on the table top where the objects can be placed.
We created two datasets for the above-mentioned work. We first created an RGB based ”CVIT
Indoor Scene dataset” of 110 images from various buildings in our campus that included cluttered floor
regions. The images contained the wide variety of indoor scenes including classrooms, living rooms,
library, corridors, two or three visible walls, etc. It also consisted of images with varied texture within the
floor, specular highlights, shadows and scenes with cluttered floors due to furniture or other obstacles,
where the clutter is not just confined to image boundaries. We also created a second RGBD dataset
”3DMOS” of 50 objects with different shapes and sizes. Each object had 15 images of different pose
and view. It also had 10 cluttered table top scenes from sparsely cluttered to densely cluttered. We have
made the dataset publicly available for the research community. The proposed approaches successfully
demonstrate the robustness and efficiency on the various mentioned complex situations in indoor scenes.
We believe that this work can play a significant role in the true understanding of an indoor scene and its
semantics. It could also help in increased interaction of robotic agents or humans with the surrounding
environment when this information is given to them.
Keywords: Semantic Segmentation, Indoor Scene, Space Estimation, Object Manipulation, Cognitive
Vision.