Abstract: Speaker-conditioned target speaker extraction algorithms aim at extracting the target speaker from a mixture of multiple speakers by using additional information about the target speaker. Previous studies have evaluated the performance of these algorithms using either instrumental measures or subjective assessments with normal-hearing listeners or with hearing-impaired listeners. Notably, a previous study employing a quasi-causal algorithm reported significant intelligibility improvements for both normal-hearing and hearing-impaired listeners, while another study demonstrated that a fully causal algorithm could enhance speech intelligibility and reduce listening effort for normal-hearing listeners. Building on these findings, this study focuses on an in-depth subjective assessment of two fully causal deep neural network-based speaker-conditioned target speaker extraction algorithms with hearing-impaired listeners, both without hearing loss compensation (unaided) and with linear hearing loss compensation (aided). Three different subjective performance measurement methods were used to cover a broad range of listening conditions, namely paired comparison, speech recognition thresholds, and categorically scaled perceived listening effort. The subjective evaluation results with fifteen hearing-impaired listeners showed that one algorithm significantly reduced listening effort and improved intelligibility compared to unprocessed stimuli and the other algorithm. The data also suggest that hearing-impaired listeners experience a greater benefit in terms of listening effort (for both male and female interfering speakers) and speech recognition thresholds, especially in the presence of female interfering speakers than normal-hearing listeners, and that hearing loss compensation (linear amplification) is not required to obtain an algorithm benefit.
Unprocessed
Algo-1
Algo-2
Unprocessed
Algo-1
Algo-2
Unprocessed
Algo-1
Algo-2
Unprocessed
Algo-1
Algo-2
The results shown here are obtained on test-set of the Librispeech dataset https://www.openslr.org/12
1. Paired Comparisons: percentage of wins in the paired comparison tests obtained for each pair of the three processing conditions (unprocessed, Algo-1, and Algo-2) for NH and HI listeners for stimuli having one interfering (F/M) and two interfering speaker(s) (FF/MM).
2. Speech Recognition Thresholds: SRTs and corresponding benefits for NH and HI listeners for unprocessed stimuli and stimuli processed using Algo-1, and Algo-2. FF and MM represent two female and male interfering speakers, respectively.
3. Perceived Listening Effort: perceived listening effort ratings and corresponding benefits for NH and HI listeners for unprocessed stimuli having one interfering speaker (F/M) and two interfering speakers (FF/MM) and stimuli processed using Algo-1, and Algo-2. F and M represents gender of the interfering speaker(s), F: female and M: male.
4. Participant-specific SRT distributions for NH and HI listeners (unaided and aided) with male and female interfering speakers across three processing types (Unprocessed, Algo-1, and Algo-2). Violin plots illustrate the score distribution and density, with boxes indicating the interquartile range and median. Individual data points on each violin plot represent individual participant scores, plotted using a swarmplot to show the spread of scores.
5. Participant-specific listening effort benefit distributions for NH and HI listeners (unaided and aided) with one interfering speaker, comparing Algo-1 and Algo-2 against the unprocessed condition at each SNR. Violin plots illustrate the score distribution and density, with boxes indicating the interquartile range and median. Individual data points are omitted for clarity due to the six SNR groupings.
6. Participant-specific listening effort benefit distributions for NH and HI listeners (unaided and aided) with two interfering speakers, comparing Algo-1 and Algo-2 against the unprocessed condition at each SNR. Violin plots illustrate the score distribution and density, with boxes indicating the interquartile range and median. Individual data points are omitted for clarity due to the six SNR groupings.