Abstract
In the digital era, huge amounts of audio data are being produced and consumed every day in a variety of languages. The increasingly widespread use of digital technology in our everyday lives has made speech data accessible to the masses, thus making it easy to record, preserve and reproduce digital content. The ability to search through this data thus becomes a valuable functionality to improve its access. But, unfortunately, the linear and non-deterministic nature of speech signal limit our ability to retrieve information from this repository efficiently. Thus, providing fast and intelligent access to these large speech collections have become a necessity to unlock their true potential as knowledge resources.
Spoken Term Detection (STD) is the task of detecting all occurrences of a query from spoken data. The term ``low resource'' may assume different meanings in different contexts. There are many languages in which no transcription of any kind is available on spoken data. Hence, for such languages, no supervised acoustic or language models could be built. In other cases, sufficient resources may exist for a language in general, but resources may be scarce for a specific application or domain for which an STD system has to be developed. In such cases, acoustic models could probably be trained for basic acoustic units (eg. subwords such as phonemes), but the appropriate language model might not exist to produce a word-level or sentence-level transcription. There are other issues like the presence of Out of Vocabulary (OOV) words. Lack of good indexing and search techniques further causes serious limitations. Using multiple ASR hypotheses usually shoots up the index size to untenable levels. Moreover, the time taken for search becomes quite high which makes such systems impractical for real time use. Hence, better representation of spoken documents, efficient storage using appropriate indexing techniques, along with fast search capability are to be explored, so that the huge quantity and variety of digital spoken data that we generate every day become more accessible and resourceful.
In this thesis, two types of STD techniques are explored. In the first part, a query-by-example STD approach is presented for language independent search. For this, four African languages in which no resource is available, except their spoken data, are used. Hence, no language-specific acoustic models could be developed to perform recognition. A Bag of Acoustic Words (BoAW) approach is used to efficiently represent and index these documents. This allows for highly scalable indexing and retrieval. A multi-stage search is performed on the indexed documents. A very efficient variation of Dynamic Time Warping, called NS-DTW, is used for accurate location of query snippets among the documents.
In the second part, a text-based STD technique is presented for English language. Here, it is assumed that phone-based acoustic models could be developed for the language, but no language models are available for the particular data that is used. Due to the absence of good language models, subword representation and indexing techniques are explored. Subword representation of textual queries are searched through the database, which is indexed using phone, hyperphone and hybrid $N$-gram units. A multi-stage search technique is employed which successively reduces the search space. In the first stage, the classic Bag of Words technique is used, which is followed by a Minimum Edit Distance re-ranking. In the last stage, an efficient Acoustic Keyword Spotter locates the precise location of query within the spoken documents.
Using these techniques, we demonstrate that it is indeed feasible to perform spoken term detection in low resource scenariois. Moreover, these show that efficient and practical STD systems could be developed which search through several hours of spoken data in real time.