Abstract
Speaker recognition (SR) involves automatic identification of individual speakers based on their
voices, often representing acoustic traits as fixed-dimensional vectors through speaker embedding. A
standard speaker recognition system (SRS) consists of three key phases: training, enrollment, and recognition. In each stage, acoustic features are extracted from raw speech signals using an acoustic feature
extraction module, resulting in the acquisition of essential acoustic characteristics. Commonly used
acoustic features include speech spectrogram, filter bank, and Mel-frequency cepstral coefficients.
During the training stage, a background model is trained to establish a mapping from training voices
to embeddings. The traditional background model employs a Gaussian Mixture Model (GMM) to generate identity-vector (ivector) embeddings. In contrast, more recent and promising background models
leverage deep neural networks (DNNs) to generate deep embeddings, like xvector. In the enrollment
stage, a voice spoken by an individual undergoing enrollment is mapped to an enrollment embedding using the previously trained background model. In the recognition stage, the process begins by retrieving
the testing embedding of a given voice from the background model. Subsequently, the scoring module
is engaged to measure the similarity between the enrollment and testing embeddings. The scoring module evaluates the similarity between the speaker and recorded embedding. Following the assessment,
the scoring and decision module makes a decision based on the similarity score. A decision threshold
is established, which serves as a criterion to determine whether the claimed identity of the speaker is
accepted or rejected.
The concept of voiceprint is rapidly gaining prominence as one of the emerging biometrics, primarily
owing to its seamless integration with natural and human-centered Voice User Interface (VUI). The
fast progress of Speaker Recognition Systems (SRSs) is intricately linked to the evolution of Neural
Networks (NNs), with a particular emphasis on Deep Neural Networks (DNNs). With strides made in
deep learning, Speaker Recognition (SR) has also benefitted and found extensive applications across
hardware and software platforms.
However, it has been shown that NNs are vulnerable to adversarial attacks, highlighting a challenge
that needs to be addressed. Thus, even though users have the convenience of authentication with Speaker
Recognition services, it has become evident that these solutions are vulnerable to adversarial attacks.
This vulnerability highlights that Speaker Recognition (SR) is encountering security threats, raising
significant concerns about user privacy.
Adversarial attack was initially implemented with images, where an image classification model was
successfully deceived using adversarial examples. Drawing inspiration from the progress made in adversarial attacks within the image domain, there is a growing interest in extending these techniques
to the audio field. With emerging trends, convolutional neural networks have demonstrated instability
to artificially crafted perturbations that remain undetectable to the human eye. Virtually every type of
model, ranging from CNN to graphical neural network (GNN), has shown vulnerability to adversarial
examples, particularly in the domain of image classification.
Deep learning models typically get audio input by converting the audio into a spectrogram for further
processing. A spectrogram serves as a condensed representation of an audio input. Given its image-like
nature, the audio spectrogram is frequently used as input data for deep learning models, especially Convolutional Neural Networks (CNNs) adapted for audio tasks. CNN-based architectures were initially
designed for image processing.
This thesis contributes to the assessment of Convolutional Neural Networks (CNNs) for their resilience against adversarial attacks, a domain that is yet to be extensively investigated concerning endto-end trained CNNs for speaker recognition. This examination is essential for sustaining the integrity
and security of speaker recognition systems. Our study fills this gap by exploring the variations of iterative Fast Gradient Sign Method (FGSM) to carry out adversarial attacks. We note that using a vanilla
iterative FGSM technique can alter the identity of each speaker sample to any other speaker within the
LibriSpeech dataset.
Additionally, we introduce adversarial attacks specific to Mel spectrogram features by (a) constraining the number of manipulated pixels, (b) confining alterations to certain frequency bands, (c) limiting
changes to particular time segments, and (d) employing a substitute model to generate the adversarial
sample. Through comprehensive qualitative and quantitative analyses, we illustrate the vulnerability
and counterintuitive behavior of existing CNN-based speaker recognition systems, wherein the predicted speak