Abstract
DNNs (Deep Neural Networks) have found use in a variety of applications recently, and have become much larger and resource-hungry over the years. CNNs (Convolutional Neural Networks) are
widely used in Computer Vision tasks, which make use of the convolution operation in successive layers – making them computationally expensive to run. Due to the large size and computation required
to run modern models, it is difficult to use them in resource-constrained scenarios, such as mobiles and
embedded devices. Also, the amount of data created and collected is growing at a rapid rate, and annotating large amounts of data to use it for training DNNs is expensive. There are approaches which seek
to intelligently query samples to train upon in an iterative fashion, such as Active Learning – but AL
setups are themselves very costly to run due to full training of large models many times in the process.
In this thesis, we explore methods to achieve extremely high speedups in both CNN model inference,
and train-times for AL setups.
Various paradigms to achieve fast CNN inference have been explored in the past, two major ones
being binarization and pruning. Binarization involves the quantization of weights and/or inputs of the
network from 32-bit full precision floats into a {-1,+1} space, with the aim of both achieving compression (as singular bits occupy 32 times less space than 32-bit floats) and speedups (as bit-bit operations
can be done faster). Network pruning, on the other hand, tries to identify and remove redundant parts
of the network in an unstructured (individual weights) or structured (channels/layers) manner to create
sparse and efficient networks. While both these paradigms – binarization and pruning – have demonstrated great efficacy in achieving speedups and compression in CNNs, little work has been done in
attempting to combine both approaches together.
We argue that these paradigms are complementary, and can be combined to offer high levels of
compression and speedup without any significant accuracy loss. Intuitively, weights/activations closer
to zero have higher binarization error making them good candidates for pruning. We propose a novel
Structured Ternary-Quantized Network that incorporates speedups from binary convolution algorithms
through structured pruning, enabling the removal of pruned parts of the network entirely post-training.
Our approach beats previous works attempting the same by a significant margin. Overall, our method
brings up to 89x layer-wise compression over the corresponding full-precision networks – achieving
only 0.33% loss on CIFAR-10 with ResNet-18 with a 40% PFR (Prune Factor Ratio for filters), and 0.3% on ImageNet with ResNet-18 with a 19% PFR.
We also explore the field of AL (Active Learning), which is used in scenarios where unlabelled
data is abundant, but annotation costs for that data make it infeasible to utilize all data for supervised
training. AL methods initially train a model with a small pool of annotated data, and use the model’s
(referred to as the selection model) predictions on the rest of the unlabelled data to form a query for
annotating more samples. The training-query process iteratively happens till a sufficient amount of data
is labelled. However, in practice, this sample selection practice in AL setups takes a lot of time as the
selection model is fully re-trained on the labelled pool every iteration. We offer two improvements to
the standard AL setup to bring down the overall train time required for sample selection significantly:
the first, the introduction of a “memory” that enables us to train the selection model for fewer samples
every round as opposed to the entire labelled dataset so far, and the second, the use of fast convergence
techniques to reduce the number of epochs the selection model trains for.
Our proposed improvements can work in tandem with previous techniques such as the use of proxy
models for selection, and the combined improvements can bring more than 75x speedups in overall
sample selection time to standard AL setups, making them feasible and easy to run in real-life scenarios
where procuring large amounts of data is easy, but labelling them is difficult.