Abstract
Supervised methods in deep learning approaches hinge on the assumption that the training and testing
data are sampled from the same distribution. However, this is not the case in most realistic scenarios and
thus leads to poor performance when these models are deployed in domains with different data distribution than the training set. Unsupervised Domain Adaptation (UDA) tackles the problem of adapting
the data distributions between a labelled source domain and an unlabeled target domain. In contrast,
Semi-supervised Domain Adaptation (SSDA) assumes a partially labelled target domain, a more realistic scenario in many computer vision tasks. Domain randomization is a popular approach wherein
models are trained on synthetically generated data. With complete control over the synthetic data generation process, domain randomization introduces randomness in various properties of the object as well
as the scene. Similar to data augmentation techniques in deep learning, the hope here is to introduce
invariance for different non-causal features of the data and even nudge the model towards learning the
causal correlations for the task at hand.
In this thesis, we explore and study various approaches to domain adaptation. First, we present
the image level domain adaptation methods, which use image-level manipulation or transformations
to achieve domain invariance. We first analyze the domain randomization approach used in an object
detection setting. For this, we use synthetically generated data and train a FasterRCNN model aimed
at the object detection task. Domain randomization helps boost the performance of object detection
models, and a model trained entirely on synthetic data outperforms the one trained on real data. With
fine-tuning, the performance of the model trained on synthetically generated data increases drastically.
Next, we extend the work on domain adaptation in the frequency domain, wherein the image level
adaptation occurs in the frequency domain. To this end, we propose new combination strategies to
combine the frequency components. We propose masking techniques which consider the frequency
of the components in the combination process. Fourier domain adaptation techniques have seen some
success in the image segmentation task from synthetic like GTA5 [1] and SYNTHIA [2] to realistic
domains like Cityscapes [3]; however, these domains contain syntactically similar images. For the
synthetic dataset that we use, we find that these frequency domain bases stylization methods do not
improve performance over the domain randomization method.
Finally, we present two novel methods for domain adaptation using feature level alignment. One of
the primary challenges in SSDA is the skewed ratio between the number of labelled source and target
samples, causing the model to be biased towards the source domain. Recent works in SSDA show that aligning only the labelled target samples with the source samples potentially leads to incomplete
domain alignment of the target domain to the source domain. In our first approach, we train the source
and target feature spaces separately. To ensure that the feature space of the target domain is generalized
well, we employ semi-supervised methods to leverage the labelled and unlabeled samples. The Domain
Adapters, which are parametric functions, are then trained to learn the feature level transformation from
the target domain to the source domain. During inference, we use the target domain’s feature extractor
and then pass the features to the respective Domain Adapter for that target-source pair. The transformed
feature representation in the source domain is then fed to the source classifier. We show that keeping the
feature extractors separate is advantageous if the domain gap between the source and the target domain
is insignificant.
Finally, we present SPI, which leverages contrastive losses to learn a semantically meaningful and
a domain agnostic feature space using the supervised samples from both domains. To mitigate challenges caused by the skewed label ratio, we pseudo-label the unlabeled target samples by comparing
their feature representation to those of the labelled samples from both the source and target domains.
Furthermore, to increase the support of the target domain, these potentially noisy pseudo-labels are
gradually injected into the labelled target dataset over the course of training. Specifically, we use a temperature scaled cosine similarity measure to assign a soft pseudo-label to the unlabeled target samples.
Additionally, we compute an exponential moving average of the soft pseudo-labels for each unlabeled
sample. These pseudo-labels are progressively injected (or removed) into (from) the labelled target
dataset based on a confidence threshold to supplement the alignment of the source and target distributions. Finally, we use a supervised contrastive loss on the labelled and pseudo-labelled datasets to
align the s