Abstract
Major goal of this thesis is to study the dialectal variations and improve the performance of speech recognition with an embeddings derived from improved dialect classification system. Initial studies focused
on improvement of the dialect classification system with three major dialects (AU:Australian, UK:Britain, and
US:American) of English.
In order to improve the performance of dialect classification system and based on the analysis of dialectal
variations, advanced signal processing approaches were proposed to investigate for dialect classification
with traditional i-vector system. The features that provide high spectral resolution will help to capture
subtle differences between dialects. So, this thesis proposed to use single frequency filtering (SFF) and
zero-time windowing (ZTW) based features that provide high spectral resolution without compromising
temporal resolution. Along with frame level spectral resolution, longer temporal context will constitute
for dialect classification. So, approaches that enhance the temporal context of proposed features (SFF and
ZTW) approaches such as delta and double delta coefficients (∆+∆∆), shifted delta coefficients (SDCs)
are experimented. It is observed that dialect classification system has given promising performance with
the proposed features with temporal context provided by ∆+∆∆ and SDCs. Further, signal processing
approaches that can provide long temporal summarization such as frequency domain linear prediction
(FDLP) are proposed for dialect classification. From experiments, with FDLP based features, it is observed
that long temporal summarization provided by FDLP based features is advantageous for discriminating
dialects. So, both the signal processing approaches that provide high spectral resolution (SFF and ZTW) and
long temporal summarization (FDLP) have shown to give promising performance in dialect classification
when compared to commonly used STFT based features.
Further, due to promising performance by deep neural networks in classification tasks and its ability
to provide longer temporal context, simpler (CNN) to advanced deep neural network (TCN, TDNN, and
ECAPA-TDNN) architectures that provide different temporal contexts are investigated, it is observed that advanced neural network architectures improved the performance of dialect classification. Further, on
evaluation of the best of both stages, it is observed that ECAPA-TDNN performed better with proposed
features (SFF).
The dialectal variations in speech degrade the performance of multi-dialectal automatic speech
recognition (ASR) system. The embeddings derived from the best dialect classification system are applied
to multi-dialect (with AU, UK, and US dialects) ASR and found to improve the performance of the ASR
system.
In most studies, Indian English is considered as a single dialect even though it has different native
speakers. So, the inclusion of foreign dialectal embeddings improved the performance of the ASR system.
The observations made in dialect classification systems with major dialects of English are extended to
foreign dialect classification (i.e., native language (or L1) identification). The embeddings extracted from the
improved dialect classification system are included along with the Indian English ASR system to improve
the performance.