VGGSounder: Audio-Visual Evaluations for Foundation Models

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

[11.06.2025] 📃 Released technical report of VGGSounder. Contains detailed discussion on how we built the first multimodal benchmark for video tagging with complete per-modality annotations for every class.

🌟 Introduction

VGGSounder is a re-annotated benchmark built upon the VGGSound dataset, designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:

🔍 Per-label modality tags (audible / visible / both) for all classes in the sample
🎵 Meta labels for background music, voice-over, and static images
📊 Multiple classes per one sample

🚀 Installation

The VGGSounder dataset is now available as a Python package! Install it via pip:

pip install vggsounder

Or install from source using uv:

git clone https://github.com/bizilizi/vggsounder.git
cd vggsounder
uv build
pip install dist/vggsounder-*.whl

🐍 Python Package Usage

Quick Start

import vggsounder

# Load the dataset
labels = vggsounder.VGGSounder()

# Access video data by ID
video_data = labels["--U7joUcTCo_000000"]
print(video_data.labels)        # List of labels for this video
print(video_data.meta_labels)   # Metadata (background_music, static_image, voice_over)
print(video_data.modalities)    # Modality for each label (A, V, AV)

# Get dataset statistics
stats = labels.stats()
print(f"Total videos: {stats['total_videos']}")
print(f"Unique labels: {stats['unique_labels']}")

# Search functionality
piano_videos = labels.get_videos_with_labels("playing piano")
voice_over_videos = labels.get_videos_with_meta(voice_over=True)

Advanced Usage

# Dict-like interface
print(len(labels))                    # Number of videos
print("video_id" in labels)           # Check if video exists
for video_id in labels:               # Iterate over video IDs
    video_data = labels[video_id]

# Get all unique labels
all_labels = labels.get_all_labels()

# Complex queries
static_speech_videos = labels.get_videos_with_meta(
    static_image=True, voice_over=True
)

🏷️ Label Format

VGGSounder annotations are stored in a CSV file located at vggsounder/data/vggsounder.csv and vggsounder/data/vggsounder+background-music.csv. Each row corresponds to a single label for a specific video sample. The dataset supports multi-label, multi-modal classification with additional meta-information for robust evaluation.

Columns

video_id: Unique identifier for a 10-second video clip.
label: Human-readable label representing a sound or visual category (e.g. male singing, playing timpani).
modality: The modality in which the label is perceivable:
- A = Audible
- V = Visible
- AV = Both audible and visible
background_music: True if the video contains background music.
static_image: True if the video consists of a static image.
voice_over: True if the video contains voice-over narration.

Example

video_id	label	modality	background_music	static_image	voice_over
`---g-f_I2yQ_000001`	`male singing`	A	True	False	False
`---g-f_I2yQ_000001`	`people crowd`	AV	True	False	False
`---g-f_I2yQ_000001`	`playing timpani`	A	True	False	False

🧪 Benchmark Evaluation

VGGSounder provides a comprehensive benchmarking system to evaluate audio-visual foundation models across multiple modalities and metrics. The benchmark supports both discrete predictions and continuous logits-based evaluation.

Supported Modalities

a: Audio - includes samples with audio component (A + AV)
v: Visual - includes samples with visual component (V + AV)
av: Audio-Visual - samples with both modalities (AV only)
a only: Audio-only - pure audio samples (excludes AV samples)
v only: Visual-only - pure visual samples (excludes AV samples)

Available Metrics

The benchmark computes a comprehensive set of metrics:

Top-k metrics: hit_rate@k, f1@k, accuracy@k, precision@k, recall@k, jaccard@k (for k=1,3,5,10)
Aggregate metrics: f1, f1_macro, accuracy, precision, recall, jaccard, hit_rate
AUC metrics: auc_roc, auc_pr (ROC-AUC and Precision-Recall AUC)
Modality confusion: mu (measures when single modalities succeed where multimodal fails)

Model Results Format

Model predictions should be saved as pickle files with the following structure:

{
    "video_id": {
        "predictions": {  # Optional: discrete predictions
            "a": ["label1", "label2", ...],     # Audio predictions
            "v": ["label1", "label3", ...],     # Visual predictions
            "av": ["label1", "label2", ...]     # Audio-visual predictions
        },
        "logits": {      # Optional: continuous scores
            "a": [0.1, 0.8, 0.3, ...],         # Audio logits (310 classes)
            "v": [0.2, 0.1, 0.9, ...],         # Visual logits (310 classes)  
            "av": [0.4, 0.6, 0.2, ...]         # Audio-visual logits (310 classes)
        }
    },
    # ... more video_ids
}

Note: Either predictions or logits (or both) should be provided. Logits enable more detailed top-k and AUC analysis.

Running the Benchmark

Quick Start

from vggsounder.benchmark import benchmark

# Define model display names
display_names = {
    "cav-mae": "CAV-MAE",
    "deepavfusion": "DeepAVFusion", 
    "equiav": "Equi-AV",
    "gemini-1.5-flash": "Gemini 1.5 Flash",
    "gemini-1.5-pro": "Gemini 1.5 Pro"
}

# Specify metrics and modalities to evaluate
metrics = [
    ("accuracy", ["a", "v", "av"]),
    ("f1", ["a", "v", "av", "a only", "v only"]), 
    ("hit_rate", ["a", "v", "av"]),
    ("mu", ["a", "v", "av"])  # Modality confusion
]

# Run benchmark
results_table = benchmark(
    models_path="path/to/model/pickles",
    display_names=display_names,
    metrics=metrics
)

print(results_table)

For a detailed example of how we generate the tables used in our paper, please see the example notebook.

Detailed Modality Confusion Analysis

VGGSounder provides a specialized function for analyzing modality confusion at the sample level, helping you understand why certain samples exhibit confusion between unimodal and multimodal predictions.

from vggsounder.benchmark import analyze_modality_confusion_detailed
from vggsounder import VGGSounder

# Analyze modality confusion for a specific model
confusion_analysis = analyze_modality_confusion_detailed(
    models_path="path/to/model/pickles",
    model_name="gemini-1.5-flash",  # Model name without .pkl extension
    vggsounder=VGGSounder(background_music=None, voice_over=None, static_image=None)
)

print(f"Found {len(confusion_analysis)} samples with modality confusion")

# Filter by specific confusion types
audio_confused = confusion_analysis[confusion_analysis['confused_a'] == True]
visual_confused = confusion_analysis[confusion_analysis['confused_v'] == True]
combined_confused = confusion_analysis[confusion_analysis['confused_av'] == True]

print(f"Audio confusion: {len(audio_confused)} samples")
print(f"Visual confusion: {len(visual_confused)} samples")
print(f"Combined confusion: {len(combined_confused)} samples")

# Examine specific confused samples
display_cols = ['id', 'ground_truth', 'pred_a', 'pred_v', 'pred_av', 'confused_a', 'confused_v', 'confused_av']
print("\nFirst 3 audio-confused samples:")
print(audio_confused[display_cols].head(3))

# Example: Find samples that are audio-confused but not visual-confused
audio_only_confused = confusion_analysis[
    (confusion_analysis['confused_a'] == True) & 
    (confusion_analysis['confused_v'] == False)
]
print(f"Audio-only confusion: {len(audio_only_confused)} samples")

Example Output:

Total samples analyzed: 2625
Audio-confused samples: 2228
Visual-confused samples: 612
Combined-confused samples: 215

First 3 audio-confused samples:
                   id                                             ground_truth                                      pred_a              pred_v             pred_av  confused_a  confused_v  confused_av
0  -0jeONf82dE_000021              [horse neighing, male speech, man speaking]                 [male speech, man speaking]                  []   [horse clip-clop]        True       False        False
1  -3Kv4fdm7Uk_000030  [plastic bottle crushing, playing flute, playing sitar]  [male speech, man speaking, playing flute]  [playing steelpan]  [playing steelpan]        True       False        False
2  -3RH8_aeZkk_000105                              [male speech, man speaking]                 [male speech, man speaking]                  []                  []        True       False        False

Example sample details:
id: -0jeONf82dE_000021
ground_truth: ['horse neighing', 'male speech, man speaking']
pred_a: ['male speech, man speaking']
pred_v: []
pred_av: ['horse clip-clop']
confused_a: True
confused_v: False
confused_av: False

Output DataFrame Columns:

id: Sample ID
ground_truth: Ground truth labels
pred_av, pred_a, pred_v: Predictions for each modality
confused_a: Boolean - audio confusion (audio hits when AV fails)
confused_v: Boolean - visual confusion (visual hits when AV fails)
confused_av: Boolean - combined confusion (both A and V hit when AV fails)

This analysis helps identify patterns in model failures and understand why certain samples cause modality confusion, enabling qualitative analysis of multimodal integration issues.

📑 Citation

If you find VGGSounder useful for your research and applications, please consider citing us using this BibTeX:

@inproceedings{zverevwiedemer2025vggsounder,
  author    = {Daniil Zverev and Thaddäus Wiedemer and Ameya Prabhu and Matthias Bethge and Wieland Brendel and A. Sophia Koepke},
  title     = {VGGSounder: Audio-Visual Evaluations for Foundation Models},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}

❤️ Acknowledgement

The authors would like to thank Felix Förster, Sayak Mallick, and Prasanna Mayilvahananan for their help with data annotation, as well as Thomas Klein and Shyamgopal Karthik for their help in setting up MTurk. They also thank numerous MTurk workers for labelling. This work was in part supported by the BMBF (FKZ: 01IS24060, 01I524085B), the DFG (SFB 1233, TP A1, project number: 276693517), and the Open Philanthropy Foundation funded by the Good Ventures Foundation. The authors thank the IMPRS-IS for supporting TW.

👮 License

This project is released under the Apache 2.0 license as found in the LICENSE file. Please get in touch with us if you find any potential violations.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
experiments		experiments
supplimentary		supplimentary
vggsounder		vggsounder
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VGGSounder: Audio-Visual Evaluations for Foundation Models

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Introduction

🚀 Installation

🐍 Python Package Usage

Quick Start

Advanced Usage

🏷️ Label Format

Columns

Example

🧪 Benchmark Evaluation

Supported Modalities

Available Metrics

Model Results Format

Running the Benchmark

Quick Start

Detailed Modality Confusion Analysis

📑 Citation

❤️ Acknowledgement

👮 License

About

Uh oh!

Releases

Packages

Languages

License

Bizilizi/VGGSounder

Folders and files

Latest commit

History

Repository files navigation

VGGSounder: Audio-Visual Evaluations for Foundation Models

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Introduction

🚀 Installation

🐍 Python Package Usage

Quick Start

Advanced Usage

🏷️ Label Format

Columns

Example

🧪 Benchmark Evaluation

Supported Modalities

Available Metrics

Model Results Format

Running the Benchmark

Quick Start

Detailed Modality Confusion Analysis

📑 Citation

❤️ Acknowledgement

👮 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages