- [11.06.2025] 📃 Released technical report of VGGSounder. Contains detailed discussion on how we built the first multimodal benchmark for video tagging with complete per-modality annotations for every class.
VGGSounder is a re-annotated benchmark built upon the VGGSound dataset, designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:
- 🔍 Per-label modality tags (audible / visible / both) for all classes in the sample
- 🎵 Meta labels for background music, voice-over, and static images
- 📊 Multiple classes per one sample
The VGGSounder dataset is now available as a Python package! Install it via pip:
pip install vggsounderOr install from source using uv:
git clone https://github.com/bizilizi/vggsounder.git
cd vggsounder
uv build
pip install dist/vggsounder-*.whlimport vggsounder
# Load the dataset
labels = vggsounder.VGGSounder()
# Access video data by ID
video_data = labels["--U7joUcTCo_000000"]
print(video_data.labels) # List of labels for this video
print(video_data.meta_labels) # Metadata (background_music, static_image, voice_over)
print(video_data.modalities) # Modality for each label (A, V, AV)
# Get dataset statistics
stats = labels.stats()
print(f"Total videos: {stats['total_videos']}")
print(f"Unique labels: {stats['unique_labels']}")
# Search functionality
piano_videos = labels.get_videos_with_labels("playing piano")
voice_over_videos = labels.get_videos_with_meta(voice_over=True)# Dict-like interface
print(len(labels)) # Number of videos
print("video_id" in labels) # Check if video exists
for video_id in labels: # Iterate over video IDs
video_data = labels[video_id]
# Get all unique labels
all_labels = labels.get_all_labels()
# Complex queries
static_speech_videos = labels.get_videos_with_meta(
static_image=True, voice_over=True
)VGGSounder annotations are stored in a CSV file located at vggsounder/data/vggsounder.csv and vggsounder/data/vggsounder+background-music.csv. Each row corresponds to a single label for a specific video sample. The dataset supports multi-label, multi-modal classification with additional meta-information for robust evaluation.
video_id: Unique identifier for a 10-second video clip.label: Human-readable label representing a sound or visual category (e.g.male singing,playing timpani).modality: The modality in which the label is perceivable:A= AudibleV= VisibleAV= Both audible and visible
background_music:Trueif the video contains background music.static_image:Trueif the video consists of a static image.voice_over:Trueif the video contains voice-over narration.
| video_id | label | modality | background_music | static_image | voice_over |
|---|---|---|---|---|---|
---g-f_I2yQ_000001 |
male singing |
A | True | False | False |
---g-f_I2yQ_000001 |
people crowd |
AV | True | False | False |
---g-f_I2yQ_000001 |
playing timpani |
A | True | False | False |
VGGSounder provides a comprehensive benchmarking system to evaluate audio-visual foundation models across multiple modalities and metrics. The benchmark supports both discrete predictions and continuous logits-based evaluation.
a: Audio - includes samples with audio component (A + AV)v: Visual - includes samples with visual component (V + AV)av: Audio-Visual - samples with both modalities (AV only)a only: Audio-only - pure audio samples (excludes AV samples)v only: Visual-only - pure visual samples (excludes AV samples)
The benchmark computes a comprehensive set of metrics:
- Top-k metrics:
hit_rate@k,f1@k,accuracy@k,precision@k,recall@k,jaccard@k(for k=1,3,5,10) - Aggregate metrics:
f1,f1_macro,accuracy,precision,recall,jaccard,hit_rate - AUC metrics:
auc_roc,auc_pr(ROC-AUC and Precision-Recall AUC) - Modality confusion:
mu(measures when single modalities succeed where multimodal fails)
Model predictions should be saved as pickle files with the following structure:
{
"video_id": {
"predictions": { # Optional: discrete predictions
"a": ["label1", "label2", ...], # Audio predictions
"v": ["label1", "label3", ...], # Visual predictions
"av": ["label1", "label2", ...] # Audio-visual predictions
},
"logits": { # Optional: continuous scores
"a": [0.1, 0.8, 0.3, ...], # Audio logits (310 classes)
"v": [0.2, 0.1, 0.9, ...], # Visual logits (310 classes)
"av": [0.4, 0.6, 0.2, ...] # Audio-visual logits (310 classes)
}
},
# ... more video_ids
}Note: Either predictions or logits (or both) should be provided. Logits enable more detailed top-k and AUC analysis.
from vggsounder.benchmark import benchmark
# Define model display names
display_names = {
"cav-mae": "CAV-MAE",
"deepavfusion": "DeepAVFusion",
"equiav": "Equi-AV",
"gemini-1.5-flash": "Gemini 1.5 Flash",
"gemini-1.5-pro": "Gemini 1.5 Pro"
}
# Specify metrics and modalities to evaluate
metrics = [
("accuracy", ["a", "v", "av"]),
("f1", ["a", "v", "av", "a only", "v only"]),
("hit_rate", ["a", "v", "av"]),
("mu", ["a", "v", "av"]) # Modality confusion
]
# Run benchmark
results_table = benchmark(
models_path="path/to/model/pickles",
display_names=display_names,
metrics=metrics
)
print(results_table)For a detailed example of how we generate the tables used in our paper, please see the example notebook.
VGGSounder provides a specialized function for analyzing modality confusion at the sample level, helping you understand why certain samples exhibit confusion between unimodal and multimodal predictions.
from vggsounder.benchmark import analyze_modality_confusion_detailed
from vggsounder import VGGSounder
# Analyze modality confusion for a specific model
confusion_analysis = analyze_modality_confusion_detailed(
models_path="path/to/model/pickles",
model_name="gemini-1.5-flash", # Model name without .pkl extension
vggsounder=VGGSounder(background_music=None, voice_over=None, static_image=None)
)
print(f"Found {len(confusion_analysis)} samples with modality confusion")
# Filter by specific confusion types
audio_confused = confusion_analysis[confusion_analysis['confused_a'] == True]
visual_confused = confusion_analysis[confusion_analysis['confused_v'] == True]
combined_confused = confusion_analysis[confusion_analysis['confused_av'] == True]
print(f"Audio confusion: {len(audio_confused)} samples")
print(f"Visual confusion: {len(visual_confused)} samples")
print(f"Combined confusion: {len(combined_confused)} samples")
# Examine specific confused samples
display_cols = ['id', 'ground_truth', 'pred_a', 'pred_v', 'pred_av', 'confused_a', 'confused_v', 'confused_av']
print("\nFirst 3 audio-confused samples:")
print(audio_confused[display_cols].head(3))
# Example: Find samples that are audio-confused but not visual-confused
audio_only_confused = confusion_analysis[
(confusion_analysis['confused_a'] == True) &
(confusion_analysis['confused_v'] == False)
]
print(f"Audio-only confusion: {len(audio_only_confused)} samples")Example Output:
Total samples analyzed: 2625
Audio-confused samples: 2228
Visual-confused samples: 612
Combined-confused samples: 215
First 3 audio-confused samples:
id ground_truth pred_a pred_v pred_av confused_a confused_v confused_av
0 -0jeONf82dE_000021 [horse neighing, male speech, man speaking] [male speech, man speaking] [] [horse clip-clop] True False False
1 -3Kv4fdm7Uk_000030 [plastic bottle crushing, playing flute, playing sitar] [male speech, man speaking, playing flute] [playing steelpan] [playing steelpan] True False False
2 -3RH8_aeZkk_000105 [male speech, man speaking] [male speech, man speaking] [] [] True False False
Example sample details:
id: -0jeONf82dE_000021
ground_truth: ['horse neighing', 'male speech, man speaking']
pred_a: ['male speech, man speaking']
pred_v: []
pred_av: ['horse clip-clop']
confused_a: True
confused_v: False
confused_av: False
Output DataFrame Columns:
id: Sample IDground_truth: Ground truth labelspred_av,pred_a,pred_v: Predictions for each modalityconfused_a: Boolean - audio confusion (audio hits when AV fails)confused_v: Boolean - visual confusion (visual hits when AV fails)confused_av: Boolean - combined confusion (both A and V hit when AV fails)
This analysis helps identify patterns in model failures and understand why certain samples cause modality confusion, enabling qualitative analysis of multimodal integration issues.
If you find VGGSounder useful for your research and applications, please consider citing us using this BibTeX:
@inproceedings{zverevwiedemer2025vggsounder,
author = {Daniil Zverev and Thaddäus Wiedemer and Ameya Prabhu and Matthias Bethge and Wieland Brendel and A. Sophia Koepke},
title = {VGGSounder: Audio-Visual Evaluations for Foundation Models},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}The authors would like to thank Felix Förster, Sayak Mallick, and Prasanna Mayilvahananan for their help with data annotation, as well as Thomas Klein and Shyamgopal Karthik for their help in setting up MTurk. They also thank numerous MTurk workers for labelling. This work was in part supported by the BMBF (FKZ: 01IS24060, 01I524085B), the DFG (SFB 1233, TP A1, project number: 276693517), and the Open Philanthropy Foundation funded by the Good Ventures Foundation. The authors thank the IMPRS-IS for supporting TW.
This project is released under the Apache 2.0 license as found in the LICENSE file. Please get in touch with us if you find any potential violations.