Abstract
Speech often consists of expressive and nonverbal sounds in addition to the message to be conveyed. The objective of this thesis is to analyze expressive voices to determine the features needed for representing, describing and discriminating different types of expressions. Here expression refers to the information in the signal produced by the speech production mechanism, that characterizes different phonations, emotions, singing (normal and artistic) and nonverbal sounds. Most of these expressive features are as a result of the variations in the excitation component of speech production processes, although it is difficult to isolate the excitation component from the dynamic vocal tract system component. Also, the expressiveness is reflected in the dynamics of the excitation component. Few methods for extracting the excitation source features rely on inverse filtering methods, which are known to have limitation for high pitched voices such as emotional speech and singing voice. In this work, we propose a representation of the excitation source in terms of impulse-like sequence to capture variations in the source. Significance of the excitation source is studied for three types of expressive voices, namely, phonation types (in speech and singing), emotional speech and
artistic (Noh) singing voice.
The primary mode of excitation is due to vibration of the vocal folds at the glottis. Even though the excitation information is present throughout the glottal cycle, it is considered to be significant only when there is large change in a short-time interval, i.e., when it is impulse-like. The impulse-like characteristic is usually exhibited around the instant of glottal closure within each glottal cycle. Apart from the primary impulse-like excitation at the instant of glottal closure, other major and minor excitation impulses may also occur at glottal opening and in creaky voice. Other excitation impulses become prominent in aperiodic voices, like artistic (Noh) singing voice. An approach for extracting the excitation source information in terms of impulse-like sequence is proposed using single frequency filtering (SFF) method. To exploit the impulse-like discontinuities from the signal, impulse properties in magnitude and phase component are explored. An impulse-like characteristic is derived using the spectral variance computed at each instant of time, and the instantaneous frequency (IF) of the filtered signal at each frequency. The slope of the variance plot shows discontinuities at the locations of the impulse-like excitation. The sum of the IFs computed for each SFF output signal shows discontinuities at the locations of the impulse-like excitations. Extraction of the impulse-like sequence is assessed using synthetic and natural speech signals. The SFF-based method also brings out the minor impulse-like excitations within each glottal cycle.
Determination of two important features of excitation, namely, glottal closure instant (GCI) and glottal open region (GOR) are derived by exploiting the changes in the spectral characteristics of the SFF spectra and HNGD spectra of the zero time windowing (ZTW) method. A spectral flatness parameter derived from the SFF spectra highlights the impulse-like characteristics at the GCI. The spectral flatness parameter derived from the HNGD spectra highlights the GOR. A new approach is proposed for detection of GCIs from emotional speech. The approach uses the recently proposed modified zero frequency filtering (mZFF) and the impulse-like sequence representation obtained using the SFF method. Using the features derived from the three signal processing methods namely, ZFF, ZTW and SFF, expressive voices such as phonation types, emotional speech and artistic (Noh) singing voice, are analyzed. Features such as locations of GCIs, strength of the glottal closure, different phases (closed and open phase) of glottal vibration and their spectral correlates, are used for analysis and detection of different phonation types in speech and singing. Cepstral coefficients derived from the ZTW (ZTWCC) and SFF (SFFCC) methods are used for discriminating different phonation types. Excitation features derived around the GCIs using ZFF method and linear prediction (LP) analysis are used to study the role of excitation information in emotional speech. One of the main features of the artistic (Noh) singing voice is the aperiodicity in the vocal fold vibration. The aperiodicity characteristics are studied in detail using the spectral characteristics of the impulse-like sequence, which shows harmonics and subharmonics of the fundamental frequency corresponding to pitch. The significance of the impulse-like sequence representation is also studied through the saliency measure. The saliency measure captures the pitch perception information in the aperiodicity regions.
The key contributions of the thesis are:
• Analysis of expressive voices using the recently proposed signal processing meth