Skip to content

pablebe/gensvs_eval

Repository files navigation

Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation

Correlation Results

This repository contains the code accompanying the WASPAA 2025 paper "Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation" by Paul A. Bereuter, Benjamin Stahl, Mark D. Plumbley and Alois Sontacchi.

ℹ️ Further information

📈 Benchmark you own metrics

A simple Python script to benchmark objective metrics on our DMOS data using the paper's correlation analysis is included in the Zenodo dataset available at DOI.
Instructions for benchmarking your own audio quality metrics are provided in the dataset’s Readme.md file.

🏃🏽‍♀️‍➡️ Model Inference and Embedding-based MSE Evaluation

If you want to use the proposed generative models (SGMSVS, MelRoFo (S) + BigVGAN) to separate singing voice from musical mixtures, or if you want to compute the embedding-based MSE of MERT or Music2Latent embeddings take a look at our PyPi package.

🚀 Getting Started:

We have released a PyPi package that enables straightforward inference of the generative models (SGMSVS & MelRoFo (S)+BigVGAN) presented in the paper. It also facilitates the computation of the embedding-based mean squared error (MSE) for the MERT and Music2Latent embeddings, which exhibited the highest correlation with DMOS data in our study. Besides enabling easy model inference and embedding-based MSE computation (currently only for MERT and Music2Latent), this package can be used to run all the training code provided in this GitHub repository. Information on how to set up an environment to run the inference and training code can be found below (see I.). You will also find instructions on setting up conda environments to reproduce the computation of all the metrics included in our study and to carry out the correlation analysis outlined in the paper (see II., III.).

I. gensvs_env: Conda environment for easy model inference, computation of MSE on MERT/Music2Latent embeddings and model training

  1. Create conda environment: $ ./env_info/gensvs_env.yml
  2. Activate conda environment $ conda activate gensvs_env
  3. Install gensvs package: $ pip install gensvs
  4. [Alternative] Run Bash script that sets up conda environment and installs gensvs from root directory: $ ./env_info/setup_gensvs_env.sh

II. gensvs_eval_env: Conda environment for computation of all non-embedding-based intrusive and non-intrusive metrics and for the correlation analysis

Set up the following environment for evaluating all non-intrusive metrics of the paper, as well as for plot and *.csv export.

  1. Create conda environment: $ ./env_info/svs_eval_env.yml
  2. Activate conda environment: $ conda activate gensvs_eval_env
  3. Install additional python dependencies: $ pip install -r ./env_info/svs_eval_env.txt
  4. Build the ViSQOL API and place in folder within the root directory
  5. To compute metrics with the PEASS-toolkit in the folder ./01_evaluation_and_correlation/peass_v2.0.1 Matlab (version<2025a) is required to execute Python's Matlab engine

III. gensvs_fad_mse_eval_env: Conda environment for evaluation intrusive embedding-based metrics

For computation of FAD and MSE evaluation metrics in addition to MERT/Music2Latent MSEs you'll need an additional conda environment. Please follow the steps below to set the environment up:

  1. Create the conda environment: $ conda env create -f ./env_info/svs_fad_mse_eval_env.yml
  2. Install additional python dependencies: $ pip install -r ./env_info/svs_fad_mse_eval_env.txt
  3. Test fadtk installation with: $ python -m fadtk.test

🧮 Evaluation and Correlation Analysis

Within the folder 01_evaluation_and_correlation all code to compute all objective evaluation metrics, the evaluation of the DCR test results and the correlation analysis of the paper are collected.

All metrics, listening test results, DMOS ratings, and the audio used for both metric computation and loudness-normalized listening tests are available on Zenodo: DOI, where a single CSV file combines all computed metrics, DMOS ratings, and individual ratings for each audio sample. The dataset also includes instructions and an example Python script to help you benchmark your own audio quality metric by calculating correlation coefficients on our dataset. If you are only interested in the final data, you can ignore the individual CSV files located in the ./04_evaluation_data folder, as they reflect intermediate steps during the paper’s development.

Reproduce the evaluation carried out in the paper

If you want to reproduce how we calculated all metrics, please download the data from DOI and copy the data into the root directory. Then follow the steps below:

Compute Objective Evaluation Metrics

To calculate all objective metrics mentioned in the paper three python scripts are necessary. The evaluation of PAM as well as the FAD & MSE metrics are carried out in separate scripts. For the computation of the FAD & MSE metrics the conda environment gensvs_fad_mse_eval_env is necessary. All other metrics can be computed with the gensvs_eval_env.

Compute FAD & MSE metrics

To compute the FAD and MSE metrics we modified the code of Microsoft's fadtk [5]. The modified code can be found in ./01_evaluation_and_correlation/fadtk_mod. The metrics can be computed with a python script. To show how the evaluation script is called we have added a examplary bash scripts.

Compute PAM scores

To compute the PAM scores of https://github.com/soham97/PAM please use the following scripts:

Compute all other non-intrusive and intrusive metrics (BSS-Eval, PEASS, SINGMOS, XLS-R-SQA, Audiobox-AES)

To compute all other metrics mentioned in the paper please use

Subjective Evaluation and correlation analysis

In order to evaluate the DMOS data, reproduce the correlation analysis results and export all the plots and *.csv files, the Python script can be executed under:

🏋🏽‍♀️🏃🏽‍♀️‍➡️ Training and Inference

Note: The inference of the generative models (SGMSVS and MelRoFo (S) + BigVGAN) are executable using gensvs. The inference can be carried out using either the included command line tool or a Python script (see gensvs).

The folder 00_training_and_inference contains all code required to carry out training and inference for all the models mentioned in the paper. To carry out this code please set up the conda environment gensvs_env with gensvs installed (see I. above)

Trained models

  • HTDemucs, MelRoFo (S), SGMSVS and the finetuned BigVGAN checkpoints (generator and discriminator) can be downloaded from Hugging Face
  • The model checkpoint for MelRoFo (L) can be downloaded here.

Inference

If you want to run the inference of all models using the Python code provided in this GitHub repository you can either use the python inference scripts directly or use the provided bash scripts which show how to call the python scripts:

Discriminative baseline models: HTDemucs, MelRoFo (S) & MelRoFo (L)

To run the mask-based baseline models on a folder of musical mixtures the following scripts can be used:

Generative models: SGMSVS & MelRoFo (S) + BigVGAN

pip-package to infere generative models on new data coming soon!

Training

Below the python training scripts and example bash scripts are listed for all models to reproduce all trainings and the set parameters mentioned in the paper:

Discriminative Baselines: HTDemucs & MelRoFo (S)

To train the dicriminative mask-based baselines use:

Generative models: SGMSVS & MelRoFo (S) + BigVGAN

To train the SGMSVS model or task-specifically finetune BigVGAN for singing voice separation with MelRoFo (S) you can use:

Third-party code

All third-party code is contained in separate folders, each of which is specifically listed in this README.md file, if there exist LICENSE files for these third party folders they are located within their respective directories.

The third-party directories are:

References and Acknowledgements

The inference code for the SGMSVS model was built upon the code made available in:

@article{richter2023speech,
         title={Speech Enhancement and Dereverberation with Diffusion-based Generative Models},
         author={Richter, Julius and Welker, Simon and Lemercier, Jean-Marie and Lay, Bunlong and Gerkmann, Timo},
         journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
         volume={31},
         pages={2351-2364},
         year={2023},
         doi={10.1109/TASLP.2023.3285241}
        }

The inference code for MelRoFo (S) + BigVGAN was put together from the code available at:

@misc{solovyev2023benchmarks,
      title={Benchmarks and leaderboards for sound demixing tasks}, 
      author={Roman Solovyev and Alexander Stempkovskiy and Tatiana Habruseva},
      year={2023},
      eprint={2305.07489},
      archivePrefix={arXiv},
      howpublished={\url{https://github.com/ZFTurbo/Music-Source-Separation-Training}},
      primaryClass={cs.SD},
      url={https://github.com/ZFTurbo/Music-Source-Separation-Training}
      }
@misc{jensen2024melbandroformer,
      author       = {Kimberley Jensen},
      title        = {Mel-Band-Roformer-Vocal-Model},
      year         = {2024},
      howpublished = {\url{https://github.com/KimberleyJensen/Mel-Band-Roformer-Vocal-Model}},
      note         = {GitHub repository},
      url          = {https://github.com/KimberleyJensen/Mel-Band-Roformer-Vocal-Model}
    }
@inproceedings{lee2023bigvgan,
               title={BigVGAN: A Universal Neural Vocoder with Large-Scale Training},
               author={Sang-gil Lee and Wei Ping and Boris Ginsburg and Bryan Catanzaro and Sungroh Yoon},
               booktitle={in Proc. ICLR, 2023},
               year={2023},
               url={https://openreview.net/forum?id=iTtGCMDEzS_}
              }

The whole evaluation code was created using Microsoft's Frechet Audio Distance Tookit as a template

@inproceedings{fadtk,
               title = {Adapting Frechet Audio Distance for Generative Music Evaluation},
               author = {Azalea Gui, Hannes Gamper, Sebastian Braun, Dimitra Emmanouilidou},
               booktitle = {Proc. IEEE ICASSP 2024},
               year = {2024},
               url = {https://arxiv.org/abs/2311.01616},
              }

If you use the MERT or Music2Latent MSE please also cite the initial work in which the embeddings were proposed.

For MERT:

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

For Music2Latent:

@inproceedings{pasini2024music2latent,
  author       = {Marco Pasini and Stefan Lattner and George Fazekas},
  title        = {{Music2Latent}: Consistency Autoencoders for Latent Audio Compression},
  booktitle    = ismir,
  year         = 2024,
  pages        = {111-119},
  venue        = {San Francisco, California, USA and Online},
  doi          = {10.5281/zenodo.14877289},
}

Citation

If you use any parts of our code, our data or the gensvs package in your work, please cite our paper and the work that formed the basis of this research.

Our paper can be cited with:

@misc{bereuter2025,
      title={Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models}, 
      author={Paul A. Bereuter and Benjamin Stahl and Mark D. Plumbley and Alois Sontacchi},
      year={2025},
      eprint={2507.11427},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2507.11427}, 
}

About

Towards Reliable Objective Metrics for Evaluating Generative Singing Voice Separation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published