Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
4d287d5
yodas2 config is added
ssh-meister May 5, 2025
dc61432
ListToEntries is added
ssh-meister May 5, 2025
6921035
ListToEntries is added
ssh-meister May 5, 2025
ea5f874
ListRepoFiles and SnapshotDownload are added
ssh-meister May 5, 2025
3fc4895
ListRepoFiles and SnapshotDownload docs are added
ssh-meister May 5, 2025
8fe7d92
Merge branch 'yodas2_pr' of https://github.com/NVIDIA/NeMo-speech-dat…
ssh-meister May 5, 2025
0c093d6
removed build
ssh-meister May 5, 2025
29daaa3
init updated
ssh-meister May 5, 2025
4eb64ef
ListYodas2Data
ssh-meister May 5, 2025
8938345
ListYodas2Data upd
ssh-meister May 5, 2025
59a2d3f
LambdaExpression
ssh-meister May 5, 2025
c632344
DownloadYodas2Data
ssh-meister May 5, 2025
cba6b06
init for yodas2 processors
ssh-meister May 5, 2025
613a486
Fixed docs
ssh-meister May 5, 2025
8220deb
Fixed docs
ssh-meister May 5, 2025
9e71140
ExtractTar
ssh-meister May 5, 2025
813f670
CreateInitialManifestYodas2
ssh-meister May 5, 2025
8ea0094
Audio converstion processors moved to convert_audio.py
ssh-meister May 5, 2025
24f8cc0
RemoveFiles
ssh-meister May 5, 2025
e73bd83
ASR inference refactoring
ssh-meister May 5, 2025
5a293c6
FasterWhisperInference
ssh-meister May 5, 2025
5ab8afc
Doc fix
ssh-meister May 5, 2025
312ec51
Fix typo
ssh-meister May 6, 2025
c725a7c
DropSpecifiedFields
ssh-meister May 6, 2025
2327a18
WhisperHallucinationFeatures
ssh-meister May 6, 2025
29cf370
vLLMInference
ssh-meister May 6, 2025
71ebb33
CleanQwenGeneration
ssh-meister May 6, 2025
1525c0c
Updated SubRegex
ssh-meister May 6, 2025
994a5f2
CountNumWords updated
ssh-meister May 6, 2025
3dbedba
FilterWithCharacterHistograms
ssh-meister May 6, 2025
f2f0822
FilterWithCharacterHistograms upd
ssh-meister May 6, 2025
b9f28da
Moved nemo PCInference to inference processors
ssh-meister May 6, 2025
d30886c
FastTextLangIdClassifier
ssh-meister May 6, 2025
e25aeb7
FastTextLangIdClassifier
ssh-meister May 6, 2025
6a2365d
CometoidWMTQualityEstimation
ssh-meister May 6, 2025
701226a
WhisperHallucinationFeatures renamed to DetectWhisperHallucinationFea…
ssh-meister May 6, 2025
8305f90
FasterWhisperInference docs updated
ssh-meister May 6, 2025
2538d3f
Requirements updated
ssh-meister May 6, 2025
4d51b97
readme moved
ssh-meister May 6, 2025
cc910d9
Added prompts and subregex params
ssh-meister May 6, 2025
cea4d62
Common phrases are added
ssh-meister May 6, 2025
6ed79e9
Skipping files from partial dir
ssh-meister May 6, 2025
b997023
DropHighLowDuration processors added to config
ssh-meister May 6, 2025
da56782
DetectWhisperHallucinationFeatures updated
ssh-meister May 6, 2025
a10d8de
DetectWhisperHallucinationFeatures updated
ssh-meister May 6, 2025
24d3ab5
Separate subregex params are added
ssh-meister May 6, 2025
26673bd
Added use_dask: False and use_regex: commom to config
ssh-meister May 6, 2025
ec666cc
Fixed typo
ssh-meister May 6, 2025
c48d1ac
yodas2.yaml updated
ssh-meister May 6, 2025
682f0f0
ConvertToTarredAudioDataset
ssh-meister May 7, 2025
81730e8
Added lazy imports
ssh-meister May 7, 2025
7d599f2
Fixed docs
ssh-meister May 7, 2025
795a841
Added partials dir skipping for gen_docs.py
ssh-meister May 7, 2025
7ce03f0
Added missed termplotlib to requirements
ssh-meister May 7, 2025
c33517b
removed termplotlib
ssh-meister May 7, 2025
32cd875
Added ConvertToTarredAudioDataset to yodas2.yaml
ssh-meister May 7, 2025
6efe46a
Removed data specific tests from common tests
ssh-meister May 7, 2025
dbc6f54
FasterWhisperInference fix
ssh-meister May 7, 2025
64105f9
ASRTarredDatasetBuilder fix
ssh-meister May 7, 2025
224d453
Test Dockerfile
ssh-meister May 7, 2025
799753b
Test Dockerfile
ssh-meister May 7, 2025
36bec4f
Test Dockerfile
ssh-meister May 7, 2025
01b80a8
Test Dockerfile
ssh-meister May 7, 2025
669d22b
Removed extra line from Dockerfile
ssh-meister May 7, 2025
ec8a0d4
Added Dockerfile and workflow for granary
ssh-meister May 7, 2025
ab20a8b
test_e2e_datasets.yml to workflow
ssh-meister May 7, 2025
5ae3072
Prevented requirements collection in subfolders
ssh-meister May 7, 2025
dc0391e
Prevent no space left during building
ssh-meister May 7, 2025
8b80117
Lightweight granary requirements
ssh-meister May 7, 2025
f3210fb
Prevent no space left during building
ssh-meister May 7, 2025
f12f81f
E2E tests check
ssh-meister May 11, 2025
2c41c13
Branch update
ssh-meister May 11, 2025
c9e8d43
requirements mofidifcation
ssh-meister May 11, 2025
8f6e1cf
Fix device setup for FasterWhisperInference
ssh-meister May 12, 2025
45fdc8a
added prepare_yodas2_data.py
ssh-meister May 13, 2025
214a1be
added text preprocessing in FastTextLangIdClassifier
ssh-meister May 13, 2025
98ab125
prepare_yodas2_data.py updated
ssh-meister May 13, 2025
6153185
added wget to add tests/prepare_test_data/prepare_yodas2_data.py
ssh-meister May 13, 2025
223bd5d
Removed already imported module
ssh-meister May 19, 2025
54a0af5
Added missed param in FasterWhisperInference
ssh-meister May 19, 2025
c9e72fd
Added HfHubDownloadYodas2Data, HfHubDownload, GetGranarysYodas2
ssh-meister May 19, 2025
7a65dbe
Simplified prepare_yodas2_data.py
ssh-meister May 19, 2025
33ef48a
New structure of files
ssh-meister May 20, 2025
71d84cb
Fix typo in ListYodas2Data
ssh-meister May 20, 2025
0bda056
Added JoinManifests
ssh-meister May 20, 2025
674f1fd
mkdir in SnapshotDownload process
ssh-meister May 20, 2025
3e12532
Yodas from Granary
ssh-meister May 20, 2025
b7ee0f9
Granary cfg moved
ssh-meister May 20, 2025
7421bb0
license
ssh-meister May 20, 2025
ef4515b
Removed extra line
ssh-meister May 20, 2025
3b8aff7
Added constant fields
ssh-meister May 21, 2025
3dc4971
Update README.md
ssh-meister May 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/docker_pull.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ jobs:

- name: Build Docker image
run: |
docker build -t sdp-test-image:${{ github.sha }} -f docker/Dockerfile .
docker build -t sdp-test-image:${{ github.sha }} \
-f docker/Dockerfile \
--build-arg SOURCE=./ .

- name: Run test tests
run: |
Expand Down
43 changes: 43 additions & 0 deletions .github/workflows/test_e2e_datasets.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: E2E Dataset Pipelines Docker Build and Test

on:
pull_request:
branches: [ "main" ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

permissions:
contents: read

jobs:
Granary:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"
- name: Install dependencies
run: |
python -m pip install --upgrade pip && \
find requirements/ -maxdepth 1 -name "*.txt" -exec pip install -r {} && \
pip install -r requirements/datasets/granary.txt && \
python -m pip cache purge

- name: Run Yodas2 E2E test
# in the future this might fail if some runtime tests require nemo
# in that case this test will need to be changed
run: |
python -m pytest tests/test_utils.py -v

- name: Get test results
if: always()
uses: actions/upload-artifact@v4
with:
name: test-results
path: |
pytest.xml
coverage.xml
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
test_data
workdir
lightning_logs
build

# unit test / coverage reports
.hypothesis
Expand Down
6 changes: 4 additions & 2 deletions dataset_configs/arabic/masc/config_filter_noisy_train.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -282,10 +282,12 @@ processors:
output_manifest_file: ${manifest_dir}/manifest21.json

# 22 keeping only low WER and CER samples
- _target_: sdp.processors.ApplyInnerJoin
- _target_: sdp.processors.JoinManifests
left_manifest_file: ${manifest_dir}/manifest21.json
right_manifest_file: ${manifest_dir}/manifest19.json
column_id: audio_filepath
merge_params:
'on': audio_filepath
how: inner
output_manifest_file: ${manifest_dir}/manifest22.json

# 23 changing paths to relative
Expand Down
5 changes: 0 additions & 5 deletions dataset_configs/granary/readme.md

This file was deleted.

18 changes: 18 additions & 0 deletions dataset_configs/multilingual/granary/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# README

This folder is designated for Granary speech data processing configuration files will be added soon. It is associated with a forthcoming paper, which will detail the work done within this project.

Note: This folder is a work in progress.

# Granary

## Yodas2

### Convert to tarred audio dataset
Suggested values for parameters like num_shards and buckets_num depend on the selected `source_lang` and whether `en_translation` is enabled. These values are provided below to help efficiently prepare a ready-to-train tarred audio dataset.

| `source_lang` | `bg` | `bg` | `cs` | `cs` | `da` | `da` | `de` | `de` | `el` | `el` | `en` | `es` | `es` | `et` | `et` | `fi` | `fi` | `fr` | `fr` | `hr` | `hr` | `hu` | `hu` | `it` | `it` | `lt` | `lt` | `lv` | `lv` | `nl` | `nl` | `pl` | `pl` | `pt` | `pt` | `ro` | `ro` | `ru` | `ru` | `sk` | `sk` | `sv` | `sv` | `uk` | `uk` |
|------------------|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|:-----:|:----:|
| `en_translation` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` | `False` | `True` |
| `num_shards` | 16 | 16 | 32 | 32 | 16 | 16 | 4096 | 1024 | 16 | 16 | 8192 | 8192 | 1024 | 16 | 16 | 64 | 32 | 4096 | 1024 | 16 | 16 | 64 | 32 | 1024 | 1024 | 16 | 16 | 16 | 16 | 1024 | 512 | 256 | 256 | 4096 | 4096 | 16 | 16 | 8192 | 1024 | 16 | 16 | 64 | 32 | 128 | 128 |
| `buckets_num` | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
236 changes: 236 additions & 0 deletions dataset_configs/multilingual/granary/yodas2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
documentation: |
YODAS2 Data Processing Pipeline
==============================

This pipeline processes the YODAS2 subset of the Granary dataset, which consists of long-form multilingual YouTube speech data
with timestamps and transcriptions. It downloads, normalizes, filters, converts, and packages the data into a tarred audio dataset
format suitable for training speech models with NeMo.

Overview
--------
The pipeline automatically downloads all required metadata and audio data, applies text normalization and filtering, converts audio
to the standard format, and generates a final Granary dataset in *tarred audio format*.

The pipeline is designed to process data for a specific source language, which must be one of the following supported languages:

::

"bg", "cs", "da", "de", "el", "en", "es", "et", "fi", "fr",
"hr", "hu", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sv", "uk"

Configuration Parameters
------------------------
- **use_dask**: Whether to use Dask for multiprocessing (recommended: False)
- **params**:
- **source_lang**: Language code (e.g., "bg" for Bulgarian)
- **use_regex**: Regex rules to apply. Usually set to the same value as `source_lang`, or to `"common"` to apply a universal normalization pattern across 25 languages.
- **en_translation**: Whether to include English translations (default: True)
- **convert_to_audio_tarred_dataset**:
- **should_run**: Whether to convert to tarred audio format
- **num_shards**: Number of tar shards
- **buckets_num**: Number of buckets for duration grouping
- **min_audio_duration** / **max_audio_duration**: Duration filters (in seconds)
- **save_disk_space**: Whether to delete intermediate files (default: True)
- **use_snapshot_download**: Whether to use snapshot_download from Hugging Face (default: False)

Output
------
After execution, the pipeline produces:

- A **tarred audio dataset**, which contains:
- Converted audio files in 16 kHz mono WAV format
- Corresponding manifests with cleaned and normalized transcripts
- Optionally, English translations

Running the Pipeline
--------------------
Run this command to launch the pipeline:

.. code-block:: bash

python main.py \
--config-path=dataset_configs/multilingual/granary/ \
--config-name=yodas2.yaml

References
----------
- YODAS2 on Hugging Face: https://huggingface.co/datasets/espnet/yodas2

Summary
-------
This pipeline prepares filtered, normalized, and language-specific audio data in a format ready for training NeMo-compatible ASR models.

use_dask: False # Whether to use Dask for multiprocessing. False = use built-in processing (recommended).

params:
source_lang: ?? # Set the language to process (e.g., "bg")
use_regex: ${.source_lang} # Regex config for text normalization. Usually same as the language, or "common" to apply a universal regex for 25 languages.
en_translation: True # If True, download also English translations (if available).
convert_to_audio_tarred_dataset:
should_run: True # If True, the final tarred dataset will be created.
num_shards: ?? # Number of tar files to split the dataset into.
buckets_num: ?? # Number of duration buckets (used for balancing durations across shards).
min_audio_duration: 0.1 # Exclude files shorter than 0.1 seconds.
max_audio_duration: 40.0 # Exclude files longer than 40 seconds.
save_disk_space: True # If True, intermediate audio files will be deleted.
use_snapshot_download: False # If True, use snapshot_download instead of Hugging Face Hub APIs.

processors_to_run: "all" # Run all processors in sequence.

workspace_dir: ?? # Required: output directory to save all intermediate and final files.
sdp_dir: ./NeMo-speech-data-processor # Path to the local clone of the SDP repo.

processors:
# 0. Get base manifest (JSONL with audio references and text)
- _target_: sdp.processors.GetGranarysYodas2
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_00.json
lang: ${params.source_lang}
translation: ${params.en_translation}

# 1. Apply regex-based substitutions to normalize text
- _target_: sdp.processors.SubRegex
text_key: text
regex_params_yaml: ${sdp_dir}/dataset_configs/multilingual/yodas2/partials/subregex_params/${params.use_regex}.yaml
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_01.json

# 2. Drop empty or whitespace-only entries
- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_02.json
text_key: text
regex_patterns:
- "^\\s*$"

# 3. Expand metadata (adds lang_subset, shard_id, etc.)
- _target_: sdp.processors.ListYodas2Data
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_03.json
use_metadata: True

# 4. Remove unused fields
- _target_: sdp.processors.DropSpecifiedFields
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_04.json
fields_to_drop:
- duration_key
- text_key

# 5. Add `source_lang` field based on lang_subset prefix
- _target_: sdp.processors.LambdaExpression
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_05.json
new_field: source_lang
expression: entry.lang_subset[:2]

# 6. Keep only entries where source_lang matches config
- _target_: sdp.processors.PreserveByValue
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_06.json
input_value_key: source_lang
target_value: ${params.source_lang}

# 7. Download tarballs with snapshot_download (if enabled)
- _target_: sdp.processors.SnapshotDownloadYodas2Data
should_run: ${params.use_snapshot_download}
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_07.json
local_dir: ${workspace_dir}/${params.source_lang}/
max_workers: 8

# 8. Download tarballs via HF Hub API (default path)
- _target_: sdp.processors.HfHubDownloadYodas2Data
input_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_06.json
should_run: ${not:${params.use_snapshot_download}}
filename_field: audio_key
output_filepath_field: local_audio
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_07.json
hf_hub_download_args:
local_dir: ${workspace_dir}/${params.source_lang}/
max_workers: 8

# 9. Extract .tar files into audio WAVs
- _target_: sdp.processors.ExtractTar
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_08.json
field_to_tar_filepath: 'local_audio'
extraction_dir: ${workspace_dir}/${params.source_lang}
remove_source_tar: ${params.save_disk_space}
filepath_prefix_field: 'lang_subset'
output_filepath_field: 'extracted_audios'
get_extracted_filepaths: True

# 10. Flatten lists of extracted audio paths
- _target_: sdp.processors.ListToEntries
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_09.json
field_with_list: 'extracted_audios'
output_field: 'source_audio_filepath'

# 11. Add yodas_id (unique audio ID) from filename
- _target_: sdp.processors.LambdaExpression
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_10.json
new_field: 'yodas_id'
expression: "entry.source_audio_filepath[-15:-4]"

# 12. Define the final audio output path for conversion
- _target_: sdp.processors.LambdaExpression
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_11.json
new_field: 'audio_filepath'
expression: "'${workspace_dir}/${params.source_lang}/converted/' + entry.lang_subset + '/' + entry.shard_id + '/' + entry.yodas_id"

# 13. Convert audio to 16kHz mono WAV
- _target_: sdp.processors.FfmpegConvert
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_12.json
input_file_key: 'source_audio_filepath'
output_file_key: 'audio_filepath'
id_key: 'audio_filepath'
converted_audio_dir: '/'
target_samplerate: 16000
target_nchannels: 1

# 14. Optionally remove the original raw audio files
- _target_: sdp.processors.RemoveFiles
filepath_field: 'source_audio_filepath'
should_run: ${params.save_disk_space}

# 15. Keep only fields needed for final merge
- _target_: sdp.processors.KeepOnlySpecifiedFields
input_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_12.json
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_13.json
fields_to_keep:
- yodas_id
- audio_filepath

# 16. Merge audio paths with filtered text
- _target_: sdp.processors.JoinManifests
left_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_02.json
right_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_13.json
merge_params:
'on': yodas_id
how: inner
copy: False
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_14.json

# 17. Add fields required for Canary model
- _target_: sdp.processors.AddConstantFields
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_15.json
fields:
decodercontext: ""
"emotion": "<|emo:undefined|>"
"pnc": "pnc"
"itn": "itn"
"timestamp": "notimestamp"
"diarize": "nodiarize"

# 18. Create the final tarred audio dataset
- _target_: sdp.processors.ConvertToTarredAudioDataset
should_run: ${params.convert_to_audio_tarred_dataset.should_run}
output_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_16.json
min_duration: ${params.convert_to_audio_tarred_dataset.min_audio_duration}
max_duration: ${params.convert_to_audio_tarred_dataset.max_audio_duration}
target_dir: ${workspace_dir}/${params.source_lang}/tarred_dataset
num_shards: ${params.convert_to_audio_tarred_dataset.num_shards}
buckets_num: ${params.convert_to_audio_tarred_dataset.buckets_num}
workers: -1
shuffle: True
shuffle_seed: 1
sort_in_shards: True
slice_with_offset: True

# 19. Optionally delete final converted audio files
- _target_: sdp.processors.RemoveFiles
input_manifest_file: ${workspace_dir}/${params.source_lang}/manifest_17.json
filepath_field: 'audio_filepath'
should_run: ${params.save_disk_space}
Loading
Loading