WARNING THIS IS NOT SUPPORTED DURING DEVELOPMENT AND YOU USE IT AT YOUR OWN RISK
Consider the main repo the dev environment and things could break and be left in that state for a while. I'm hobbyist messing about with this to see if it can speed up sound stage duties for short videos in Comfyui. It tested working for one shot image results, but needs some more work.
Adaptation is underway to see if it can be used to work with Comfyui for image or ideally video ambience when applying audio to a scene clip created in comfyui. I would recommend not installing this version at this stage while it is being worked on. I will post updates here if it becomes something I think is going to be useful and useable - mdkberry (August 2025)
Nikhil Singh, Jeff Mentch, Jerry Ng, Matthew Beveridge, Iddo Drori
Code for the ICCV 2021 paper [arXiv]. Image2Reverb is a method for generating audio impulse responses, to simulate the acoustic reverberation of a given environment, from a 2D image of it.
Updated to work in a conda environment and use pytorch 2.7 and CUDA 12.8, see requirements.txt and environment.yml for more information
- Images should be 512x512 pixels for best results
- Supported formats: JPG, PNG, BMP, TIFF
To generate an impulse response from a single image, you can use the provided run_single_image.py script:
python run_single_image.py --image_path path/to/your/image.jpg --output_dir ./resultsThis script will:
- Resize your image to 512x512 pixels if needed
- Create a temporary dataset structure
- Run the model on your image
- Save the results in the specified output directory
- Clean up temporary files
The required pre-trained models should be placed in the models folder:
- Places365 ResNet50 model:
models/resnet50_places365.pth.tar - Monodepth2 models:
models/mono_640x192/folder containingencoder.pthanddepth.pth - Image2Reverb checkpoint:
models/model.ckpt
If you haven't already downloaded these models, you can get them from:
- Places365 ResNet50 model: http://places2.csail.mit.edu/models_places365/resnet50_places365.pth.tar
- Monodepth2 models: From https://github.com/nianticlabs/monodepth2
- Image2Reverb checkpoint: https://media.mit.edu/~nsingh1/image2reverb/model.ckpt
Here's what each of the generated files contains and how to use them:
-
results/test/test.wav: This is the main output - the impulse response (IR) audio file generated from your input image. This file simulates how sound would reverberate in the environment depicted by your image. You can use this IR in audio processing software or digital audio workstations (DAWs) to apply realistic reverb to your audio.
-
results/test/input.png: This is a visualization of your input image as processed by the model. It may include some preprocessing or modifications made by the model.
-
results/test/depth.png: This shows the estimated depth map of your input image, which the model uses to understand the 3D structure of the environment.
-
results/test/spec.png: This is a visualization of the spectrogram of the generated impulse response, showing how the frequency content of the reverb changes over time.
-
results/t60.json: Contains RT60 values (reverberation times) for different frequency bands. RT60 is the time it takes for sound to decay by 60dB, which is an important acoustic parameter.
-
results/t60.png: A graphical representation of the RT60 values in a box plot.
-
results/t60_err.npy: Numerical data about the T60 error metrics.
To use the generated impulse response:
- Take the
test.wavfile from theresults/test/directory - Import it into your audio software or DAW
- Use it as an impulse response in a convolution reverb plugin
- Process your dry audio with the reverb to simulate the acoustics of the environment in your input image
The model has successfully converted your 2D image of a room into a realistic 3D acoustic simulation in the form of an impulse response audio file.
Here are some examples of the Image2Reverb model's output for different environments:
Cathedral Impulse Response (WAV)
Bedroom Impulse Response (WAV)
Empty Field Impulse Response (WAV)
We borrow and adapt code snippets from GANSynth (and this PyTorch re-implementation), additional snippets from this PGGAN implementation, monodepth2, this GradCAM implementation, and more.
If you find the code, data, or models useful for your research, please consider citing our paper:
@InProceedings{Singh_2021_ICCV,
author = {Singh, Nikhil and Mentch, Jeff and Ng, Jerry and Beveridge, Matthew and Drori, Iddo},
title = {Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {286-295}
}











