Abstract
The nonverbal speech sounds such as emotional speech, paralinguistic sounds and expressive voices
produced by human beings are different from normal speech. Whereas, the normal speech conveys
linguistic message and has clear articulatory description, these nonverbal sounds also carry nonlinguistic information but without any clear description of articulation. Also, these sounds are mostly unusual, irregular, spontaneous and nonsustainable. Examples of emotional speech are shouts, happy, anger, sad etc., and of paralinguistic sounds laughter, cry, cough etc. Besides, the expressive voices like Noh voice or opera singing are trained voices to convey intense emotions. Emotional speech, paralinguistic sounds and expressive voices differ in the degree of pitch changes. Another categorisation based upon voluntary control and involuntary changes in the speech production mechanism is also possible.
Production of nonverbal sounds occurs in short bursts of time and involves significant changes in
the glottal source of excitation. Hence, production characteristics of these sounds differ from those
of normal speech, mostly in the vibration characteristics of the vocal folds. Associated changes in the characteristics of the vocal tract system are also possible. In some cases of normal speech such as trills or emotional speech like shouts, the glottal vibration characteristics are also affected by the acoustic loading of the vocal tract system and system-source coupling. Hence, characteristics of these nonverbal sounds need to be studied from the speech production and perception points of view, to understand better their differences from normal speech.
Excitation impulse sequence representation of the excitation source component of speech signal has
been of considerable interest in speech research, in past three decades. Presence of secondary impulses
within a pitch period was also observed in some studies. This impulse-sequence representation was
mainly aimed at achieving low bit-rates of speech coding and higher voice quality of synthesized speech.
But, its advantages and role in the analysis of nonverbal speech sounds have not been explored much. The differences in the locations of these excitation impulse-like pulses in the sequence and their relative amplitudes, possibly cause differences in various categories of acoustic sounds. In the case of nonverbal speech sounds, these impulse-like pulses occur also at rapidly changing or nearly random intervals, along with rapid or sudden changes in their amplitudes. Aperiodicity in the excitation component may be considered as an important feature of the expressive voices like ‘Noh voice’. Characterizing changes in the pitch perception that could be rapid in the case of expressive voices, and extracting F0 especially in the regions of aperiodicity are major challenges, which need to be investigated in detail.
In this research work, the production characteristics of nonverbal speech sounds are examined from
both the electroglottograph and acoustic signals. These sounds are examined in four categories, which
differ in the periodicity (or aperiodicity) of glottal excitation and rapidness of changes in pitch perception. These categories are: (a) normal speech in modal voicing that includes study of trill, lateral,
fricative and nasal sounds, (b) emotional speech that includes four loudness level variations in speech,
namely, soft, normal, loud and shouted speech, (c) paralinguistic sounds like laughter in speech, and
(d) expressive voices like Noh singing voice. The effects of source-system coupling and acoustic loading of the vocal tract system on the glottal excitation are also examined.
The signal processing methods like zero-frequency filtering, zero-time liftering, Hilbert transform
and group delay function are used for feature extraction. Existing methods like linear prediction (LP)
coefficients, Mel-frequency cepstral coefficients and short-time Fourier spectrum are also used. New
signal processing methods such as modified zero-frequency filtering (modZFF), dominant frequencies (FD) computation using LP spectrum or group delay function and saliency computation (a measure for pitch perception) are proposed in this work. A time-domain impulse sequence representation of the excitation source is proposed, which also takes into account the pitch perception and aperiodicity in
expressive voices. Using this representation, a method is proposed for extracting F0 even in the regions of subharmonics and aperiodicity, which otherwise is a challenging task. Validation of results is carried out using spectrograms, saliency measure, perceptual studies and synthesis.
The efficacy of the signal processing methods proposed in this work, and the features and parameters derived, is also demonstrated through some applications. Three prototype systems are developed for automatic detection of trills, shout and laughter in continuous speech. These systems use