Track: VoiceTech |
Evaluating Speech Separation Through Pre-trained Deep Neural Network Models |
This presentation focuses on speaker separation, which aims to separate individual speakers from a mixture of voices or background noise, commonly known as the "cocktail party problem." The objective is to separate the two original audios from their mix and analyze the features present that contribute to separation. The analysis proposes obtaining features from the original data and evaluating their impact on the model's ability to separate mixed audio streams. The dataset is prepared to use these feature values as predictor variables for various models such as Logistic Regression, Decision Trees, SVM, XGBoost, and AdaBoost. The goal is to identify the most contributing features that lead to better separation. The results are then analyzed to determine the features that have the most significant effect on separating the audio streams. The study begins by selecting 400 audio streams from the VoxCeleb dataset and combining them to form 200 single utterances. The pre-trained Speechbrain model, sepformer-whamr, is utilized to separate the audio mixes and obtain two outputs that closely resemble the original sources. A feature list is generated from the 400 chosen audios, and the impact of certain features on the model's capability to distinguish between multiple audio sources in a mixed recording is assessed. Permutation feature importance and SHAP values are used as analysis parameters to determine the features that have a greater effect on separation. The hypothesis of the study is that the features contributing the most to effective separation are consistent across datasets. To test this hypothesis, 1,000 audio streams are obtained from the Mozilla Common Voice Dataset, and the same experimental methodology is applied. The results demonstrate that the features extracted from the VoxCeleb dataset are indeed invariant and aid in separating the audio streams of the Mozilla Common Voice dataset. |
|
Presentation Notes |
Prabhakar-Evaluating-Speech-Separation-Through-Pre-trained-Deep-Neural-Network-Models2.pptx |