Skip to the content.

Noise-robust Speech Separation with Fast Generative Correction

Anonymous submission to Interspeech 2024


Speech separation, the task of isolating multiple speech sources from a mixed audio signal, remains challenging in noisy environments. In this paper, we propose a generative correction method to enhance the output of a discriminative separator. By leveraging a generative corrector based on a diffusion model, we refine the separation process for single-channel mixture speech by removing noises and perceptually unnatural distortions. Furthermore, we optimize the generative model using a predictive loss to streamline the diffusion model’s reverse process into a single step and rectify any associated errors by the reverse process. Our method achieves state-of-the-art performance on the in-domain Libri2Mix noisy dataset, and out-of-domain WSJ with a variety of noises, improving SI-SNR by 22-35% relative to SepFormer, demonstrating robustness and strong generalization capabilities.

Model Overview

Fast generative corrector


Samples 1 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 2 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 3 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 4 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 5 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 6 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 7 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 8 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 9 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image

Samples 10 from LibriMix noisy test set.

Mixture Reference (speaker 1) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
Image Image Image Image Image
  Reference (speaker 2) Estimated (SepFormer) Estimated (GeCo) Estimated (Fast-GeCo)
  Image Image Image Image