Skip to content

iot-unimore/verse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Virtual Environment for Rendering of Speech Emissions

This is a project focusing on human voice perception with different purposes. Among the others, the main goal is to study the advantages of an array of microphones versus binaural audio signals, in the context of real embedded devices and including machine learning algorithms (in particular neural networks) for signal processing, with a particular focus on human voices.

VERSE contains a semi-synthetic dataset of voice recordings and real environment characterization measurements. It includes a complete software framework able to generate synthetic audio data from measures on the real setup, keeping the result acoustically as close as possible to the equivalent "direct recording".

VERSE is released as an open source repository serving both functions: providing a flexible dataset and providing a tool to customize the dataset itself based on the specific subject under study.

Even if VERSE originally was designed to study the performance of an array of microphones placed on a pair of glasses, and to compare results with binaural earing, the flexibility of the framework allows the user to generate sequences using any microphone geometry. This extends the usability of VERSE tool chain to use cases different from multimedia glasses, like robotic heads for industrial applications, or helmets for safety awareness, allowing the optimization of microphone placement to obtain the best results with the provided subject.

VERSE is based on the abstraction of main components for an audio scene: voice sound sources from human speakers, one listener (the "head" of the device under test) and reverberation generated by the environment itself, meaning the room hosting speakers and listener. Specific to the definition of a scene is the concept of motion: the scene defines how sound sources are placed around the listener and how they move in space.


The Dataset

This repo contains a set of resources and configuration files (.YAML) to generate the dataset offline. Only single resources are provided, there are not full audio scenes. These can be rendered as virtual audio scene using the provided tools. User can easily add resources (voices, heads, rooms) and define scenes to create new virtual environments and render audio for it.

Dataset configurations are provided as resources too: these are recipes that defines the mix of basic resources to compose the final output. Few examples are already available in this repository. (t.b.d)

System Configuration

The code and framework for this repo has been developed and tested in Linux, specifically Ubuntu 20.04. There is no "native" support for Windows environment (t.b.d docker image). You will need a large disk space since even the small datasets can easily generate 100GBbytes due to all the permutations of parameters. The code has been written to leverage multi-processing as much as possible. A moder CPU with at least 8-16 cores is preferred. The code is also tested on AMD Threadripper 3960x/128G and 3990x/256G.

Installation

After cloning this repository you need to setup your environment (only once) to be able to use VERSE toolchain. The instructions on how to create your virtual environment and fetch resources are available at this page First Setup

Folder Structure

The structure of the repository is the following:

+-- datasets
+-- resources
|   +-- ds_recipes
|   +-- heads
|   +-- paths
|   +-- rooms
|   +-- scenes
|   +-- voices
|
+-- src
|   +-- dataset_render.py
|   +-- scene_render.py
+-- tools
    +-- sound_spatializer

DATASETS is the folder that will contain the final audio rendering.

RESOURCES is the folder where all the main components are placed, each one with a repetitive folder structure and YAML decriptors, using the following schema:

+-- resources
    +--[RESOURCE_TYPE]
       +--[RESOURCE_NAME]
          +--info.yaml
          +--fetch_files.sh
          +--info
             +--[FILENAME].yaml
          +--files
             +--[AUDIO_FILE].ext

RESOURCE_TYPE folder is defined as voices, heads, rooms, paths, scenes. Inside each RESOURCE_TYPE folder there is one sub-folder for each data provider. This allow to pull data of the same type from different locations, even outside of the VERSE repository, as long as the following information are provided. Each RESOURCE_TYPE is defined by a "fetch_file.sh" script which allows to retrieve binary files from external repositories, and a mandatory "info.yaml" which decribe the resource itself.

Binary files are placed into the "files" subfolder and for each binary file there is a correspondent yaml descriptor into the "info" sub-folder. The purpose of the main info.yaml file is to provide a human readable description of the content for this specific resource, while the purpose of the single yaml files placed into the info sub-folder is to provide a human readable description of each binary files composing the resource itself.

This folder structure is repeated for each resource type, providing a uniform setup to parse available data either manually by human or automatically by code.

Audio Rendering

Once the environment is setup and all the resources have been prepared you can render audio using two scripts: the "render_scene" and the "render_dataset"

These scripts are the core of the VERSE toolchain: the "render_dataset" script will read a recipe (under "verse/resources/ds_recipes") prepare a set of scene files and launch the "render_scene" multiple times to create audio files in a specific subfolder of "verse/datasets". You can render multiple datasets, they will have separate folders (as long as there is enough space on the disk)

Beside the rendering scripts few other tools are available under "verse/tools/bin":

├── display_path.py
├── display_scene.py
├── display_sofa.py
├── parse_sofa.py
├── play_scene.py
└── sspat (sound spatializer)

These commands will be explained in the next section (Exploring results / Exploring resources)

A quick test

To verify if your setup is working correcly you can render the "simple_example" dataset. This is just a testing recipe to create a handful of files using the VERSE toolchain.

cd verse/src
./render_dataset.py -i ../resources/ds_recipes/simple_example/info/simple_example.yaml -v

Use option "-v" to enable verbose. If your cpu has many cores you can use option "-c" to enable more parallel rendering.

this will create a subfolder with few files under the "verse/datasets" folder :

verse/datasets
├── readme.txt
└── simple_example
    └── train
        ├── 000000_static_singlevoice_0_0_0
        │   ├── 000000_static_singlevoice_0_0_0.yaml
        │   ├── static_singlevoice.mkv
        │   └── static_singlevoice_mkv.yaml
        ├── 000001_static_singlevoice_0_0_1
        │   ├── 000001_static_singlevoice_0_0_1.yaml
        │   ├── static_singlevoice.mkv
        │   └── static_singlevoice_mkv.yaml
        [ ... ]
        ├── 001300_dynamic_multivoice_2_0_1
        │   ├── 001300_dynamic_multivoice_2_0_1.yaml
        │   ├── dynamic_multivoice.mkv
        │   └── dynamic_multivoice_mkv.yaml
        └── 001301_dynamic_multivoice_0_1_1
            ├── 001301_dynamic_multivoice_0_1_1.yaml
            ├── dynamic_multivoice.mkv
            └── dynamic_multivoice_mkv.yaml

The .mkv Matroska file will contain the original human voices and the rendered, virtual spatial audio. Two .yaml files are available: one to describe the .mkv content (track by track) and one to describe the audio scene that was used to render the final audio.

All these artifacts can be explored as explained in the section below

Exploring results

Each of the artifacts available in a subfolder of the dataset has one purpose and can be explored with specific tools. Tools are placed in the "verst/tools/bin" folder of the repository.

Play audio

The full audio scene is contained in a Matroska (.mkv) file, this container format was selected because it is open-source, widely adopted, and allows to encapsulate multiple audio tracks (.wav) into one file. Having one file with both "raw" human voices and virtual audio is beneficial for machine learning applications.

using the "play_scene.py" (which leverages ffmpeg/ffplay) tool you can list or play each of the track contained in a file. The Matroska files has metadata to identify each track that you can list using the "-l" option.

Focusing on the last case of the "simple_example" dataset we have:

[VERSE]/tools/bin/play_scene.py -i ./dynamic_multivoice.mkv -l
0 : 000056_gentlemenpreferblondes.wav
1 : 000027_blackbuccaneer.wav
2 : 000071_gianburrasca.wav
3 : dynamic_multivoice_binaural_000.wav
4 : dynamic_multivoice_array_six_front_001.wav
5 : dynamic_multivoice_array_six_middle_002.wav
6 : dynamic_multivoice_array_six_rear_003.wav

where [VERSE] is the folder location of your VERSE repository. Using the same tool you can play a single track, each track being a stereo (mic touple). To play the binaural (ear level) mics simply use:

[VERSE]/tools/bin/play_scene.py -i ./dynamic_multivoice.mkv -t 3

You can also use Audacity to visualize audio in detail, as explained in detail here: Audacity_HowTo

Display scene

Each virtual audio file is generated by a scene file (.yaml). An audio scene contains all the details about source positioning in space around the listener (listener is positioned in the zero-origin of space coordinates). The main purpose of the audio scene syntax is to define each component of the scene: voices and their position, listener head and mic array, room reverberation. The scene itself is a resource of VERSE dataset and can be explored as detailed here Scene_HowTo

As a quick reference you can use the "display_scene" tool to visualize source positioning and/or their movement in space. Focusing on the same example we can show the scene with command:

/tools/bin/display_scene.py -i ./001301_dynamic_multivoice_0_1_1.yaml

which will generate a plot similar to the one in figure. From the image we can recognize that this scene is composed by n.3 human voices: one is static placed at 1 metre distance in front of the listener (blue color, does not move during the audio rendering). One is dynamic, goes from the rigth side (gree dot) of the speaker to the left side (red dot) while moving on a semi-circular path on the back side of the listener. The 3rd source is also dynamic, but moves from right to left on a linear path in front of the listener (green color path)


More datasets

Beside the simple example two datasets are already defined in this repository

  • unimore_tiny: a set of ~1000 audio scenes for static and dynamic movements of human voices (up to three voices) in a single room. Final rendering size is about ~54GBytes
  • unimore_small: a set of ~12000 audio scenes, similar to the unimore_tiny set but with more voices. Larger disk space required.

To render the datasets simply use the same syntax as before:

cd verse/src
./render_dataset.py -i ../resources/ds_recipes/unimore_tiny/info/unimore_tiny_recipe.yaml

or

./render_dataset.py -i ../resources/ds_recipes/unimore_small/info/unimore_small_recipe.yaml

Resource definition

For "resources" we refer to the core components of an audio scene: human voices, listener head (and receivers location), room (for reverberation) and motion_path to define source motion during the audio rendering.

Each resource has a specific binary format depending on the purpose of the resource itself. Resources could be retrieved also from different (external) datasets to expand possibilities of VERSE. For this reason each resource has an abstraction layer which leverages YAML syntax to define the content of a resource folder.

Starting from [VERSE]/resources we see a folder for each type: voices, heads, paths, rooms, scenes.

When selecting the resource type folder we have the list of different subset of that specific resource, each subset being a specific "selection" made by the user or someone on behalf of the user. For example selecting "voices" we have two subtypes:

├── librivox_tiny
└── unimore

The first subset is related to LibriVox, providing a small selection of human voices of different gender/language. The second subset is a (even smaller) selection of audio files for testing purposes.

No matter what is the source of a resource subtype, there will be always the same folder structure [RESOURCE][SUBTYPE]/info like below

├── fetch_files.sh
├── files
├── info
└── info.yaml

The upper level info.yaml file is a generic descriptor for the resource, listing type, ownership, copyright and especially the amount of disk space this resource will occupy once all its files are present. For example in the case of librivox_tiny we have:

# VERSE resource info                  
syntax:
  name: resource_info
  version:
    major: 0
    minor: 1
    revision: 0

title: librivox_tiny
type: dataset
content: audio
description: a curated small selection of audio recordings for human voice (single person)

size_bytes: 1.3G

source: https://librivox.org/
source_original:

fetch_script: fetch_files.sh

copyright: public_domain
license: https://en.wikipedia.org/wiki/Public_domain
details: https://wiki.librivox.org/index.php?title=Copyright_and_Public_Domain

The most important folders are "files" and "info".

The first will contain all the "raw" data and it is normally populated by the "fetch_files.sh" script present in the same resource folder. User should place in fetch_files.sh all the needed instructions and access codes to pull external resources like .wav audio files that normally are not placed inside a github repository.

For each one of the raw resource files there will be a corrispondant "info" file, again leveraging YAML syntax, to describe the resource itself.

NOTE: to distinguish each YAML file content the first part is always a "syntax" field, exposing the structure and syntax of the rest of the file. The "syntax/name" will be different depending on the resource type, for example we have "resource_info" to highlight a generic resource descriptor file, "voice_file" to indicate a specific voice file type, "audio_rendering_scene" to indicate a scene file etc.

The syntax for each resource info file is defined and detailed in a dedicated howto:

Adding new resources to the VERSE repository requires to create a subfolder (subtype) inside the appropriate location and to create a correct info file following the documented syntax. This will allow the "render_dataset" and "render_scene" scripts to automatically retrieve the source for final audio rendering.

Exploring resources

To simplify development and verification of data it is useful to have a graphical inspection of resources as defined by their descriptors. This allow the use to make sure a resource is fully compliant with the requirements for VERSE.

As mentioned before "[VERSE]/tools/bin" provides a set of scripts to inspect a specific resource type. Each tool is documented in a separate page

  • display_scene: to graphically show an audio scene structure, see display_scene
  • display_path: to graphically show a motion path, see display_path
  • display_sofa: to graphically show SOFA file (Spatially Oriented Format for Acoustics), see display_sofa
  • parse_sofa: to inspect the metadata of a SOFA file, see parse_sofa
  • play_scene: to playback the audio stream generated by VERSE, see play_scene

Dataset definition

The rendering of a dataset is done once (offline) and it is based on a "recipe". Dataset recipes (ds_recipes) are a resource. The user can define different recipes to mix&match scenes, voices, listeners to create his own specific dataset.

The definition of ds_recipes is specified in details here: dataset_syntax

The ds_recipe is a powerful tool, this is the aggregator of all the components forming a syntetic audio scene, with the capability of mixing and matching single resources of different types to generate the final dataset collection of (virtual) audio recordings.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published