Abstract
Many real-world speech processing applications depend on the accurate estimation of epoch location. In a real-world scenario, speech may contain multiple speakers or multiple emotions in a single utterance. In the literature, the performance of epoch extraction algorithm was evaluated on speech utterance consisting of a single speaker or single emotion alone. In this study, speech utterance from different speakers, and emotions are combined to simulate multi-speaker and multi-emotion scenario, respectively. Five state-of-the-art epoch extraction methods were used to evaluate the performance for multi-speaker, multi-emotion, and voice disorder scenarios. We also analysed the performance of state-of-the-art epoch extraction methods after applying the region-wise approach on existing methods. CMU arctic, and Berlin emotional database is used in this study to simulate the multi-speaker, and multi-emotion scenarios, respectively for analysing the performance of epoch extraction algorithms. Saarbruecken voice disorder database is used to perform the experiments for voice disorder scenario. The result of this study indicates that performance of state-of-the art epoch extraction methods (which depend on pitch period) degrades in terms of false alarm rate (FAR) for these three scenarios. By applying the region-wise approach on these methods for extraction of epoch locations, FAR is reduced by 5% to 8% for the real-world scenario. It was also observed from this study, that the performance of dynamic programming based methods remains unchanged even after applying the region-wise approach on existing methods.