Abstract
Humans perceive speech as made up of sequence of discrete sounds. However to represent speech
as a sequence of phones with well defined, non-overlapping intervals is a challenging problem. This
is because of the continuously varying vocal tract system (i.e., vocal tract configuration and vocal fold
vibration) leading to the co-articulation effect, where the current phone is considerably influenced by
its adjacent phones. Despite this co-articulation effect, there exists events called landmarks, occurring
due to abrupt variations in the vocal tract system. Detection of these landmarks can be used to represent
speech as a sequence of phones which may help in building systems that process speech similar to that
of humans.
Landmarks are the time instants in the acoustic signal that are consistently correlated to major artic-
ulatory movements such as transition of the vocal tract from a more open to a more closed configuration
and vice-versa (i.e., change in manner of articulation), and transition from a free vibration to complete
cessation of vocal fold vibration and vice-versa. These landmarks are foci and hence processing of
speech can be done only around landmarks instead of considering the entire speech signal, thus reduc-
ing the amount of processing required. Also, the analysis around different landmarks can be done with
different resolutions. Landmark detection is hierarchical and hence more than one evidence can be ob-
tained for making decisions. In this thesis, the hierarchy that exists among landmarks is exploited for
vowel landmark detection (VLD). Sonorant segmentation is performed first, and then vowel landmarks
are detected in the sonorant regions.
Sonorant refers to the sound that is produced with no sufficiently strong constriction so as to produce
turbulent noise or stoppage of airflow. The broad manner classes like vowels, nasals and approxi-
mants are categorized under sonorants, whereas fricatives, stops and non-speech regions are considered
as non-sonorants. Sonorant segmentation of speech signals is critical in developing automatic speech
recognition (ASR) systems, audio search systems and for automatic segmentation of speech corpora. In
this work, acoustic features based on excitation source and vocal tract system characteristics of sono-
rant sounds are proposed for segmentation of sonorant regions in continuous speech. The features are
based on energy of zero frequency resonator signal, strength of excitation and dominant resonance fre-
quency around epochs. An algorithm is developed to relate these features in a hierarchical manner
using knowledge-based approach. Performance of the proposed algorithm is studied on three different
datasets, for varying levels of degradation.
Vowels being a subclass of sonorant sounds, they exhibit characteristics similar to sonorants. So the
features used for sonorant segmentation of speech are also considered for detection of vowel landmarks.
Apart from these, features which capture characteristics specific to vowels are also considered for the
task of vowel landmark detection (VLD). Using these features, a rule-based algorithm is developed
for VLD. Performance of the proposed VLD algorithm is studied on three different databases namely,
TIMIT (read), NTIMIT (channel degraded) and Switchboard corpus (conversational speech). The pro-
posed algorithm is also tested on TIMIT and NTIMIT datasets for different levels of noise degradations.
Speech-laugh is a speech-synchronous form of laughter that often occurs in natural conversation.
Speech-laugh not only signifies the emotional state of a speaker, but also carry the linguistic informa-
tion. Traditional automatic speech recognition (ASR) systems consider both laughter and speech-laugh
as paralinguistic elements. This resulted in loss of information. Discriminating speech-laughs from
laughter improves the accuracy of ASR systems. It also helps to know the emotion expressed by laughter
i.e., happy, sarcasm etc. In this work, as an application of VLD, excitation source features extracted only
in the vowel regions are analyzed for discriminating speech-laugh from laughter and neutral speech