Abstract
The task of automatically detecting syllable stress is a key module in computer-assisted language learning systems. There are numerous studies proposed in the literature for automatic syllable stress detection by using different knowledge-based prosodic features. Also, different statistical machine learning and deep learning models are explored for this task using knowledge-based features. However, the acoustic parameters considered to compute knowledge-based features might not always represent the stress phenomena, hence the knowledge based features are not always suitable for generalization and scalability. Recently, the rapidly emerging self-supervised learning based representations are outperforming the existing state-of-the-art knowledge-based features in all speech applications. Also, these representations allow the models to be built in an end-to-end fashion. In this work, we explore the use of self-supervised representations (Wav2Vec-2.0), for syllable stress detection and compare the performance with state-of-the-art knowledge-based features. Further, we use our recently proposed explicit representation learning framework, modeled by jointly optimizing variational autoencoder (VAE) and DNN for stress detection. We analyze the performance of representation learning framework with two different state-of-the-art classifiers, support vector machines (SVM) and simple deep neural network (DNN). We conduct experiments on two non-native English speakers’ datasets from ISLE corpus i.e., German (GER), and Italian (ITA). From the analysis study, it is observed that the classification accuracy for syllable stress detection using self-supervised representations significantly improved by 3.2% and 2.7% over knowledge-based features in GER and ITA, respectively. From the t-SNE plots, it is observed that the representations learned by explicit representation learning framework with VAE show better discrimination among stressed and unstressed syllables compared with representations learned implicitly with simple DNN.