Skip to content

Commit 1ac730d

Browse files
authored
Granary large-scale speech processing pipeline (#155)
1 parent d7d4927 commit 1ac730d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+5510
-77
lines changed

.github/workflows/tests.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,8 @@ jobs:
7474
pip install Cython wheel # need to pre-install to avoid error in nemo installation
7575
pip install nemo-toolkit[asr,nlp]==2.2.1
7676
pip install nemo_text_processing
77-
pip install pymarian
7877
pip install -r requirements/huggingface.txt
78+
pip install pymarian
7979
pip install certifi #this needed to avoid problems with certificates [COORAL]
8080
export SSL_CERT_FILE=$(python -m certifi)
8181
python -m pip cache purge
@@ -93,7 +93,13 @@ jobs:
9393
sudo cp incommon-rsa-ca2.pem /usr/local/share/ca-certificates/incommon-rsa-server-ca-2.crt # [cert for CORAL]
9494
sudo update-ca-certificates # [cert for CORAL]
9595
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
96-
python -m pytest tests/ --junitxml=pytest.xml --ignore=tests/test_tts_sdp_end_to_end.py --cov-report=term-missing:skip-covered --cov=sdp --durations=30 -rs | tee pytest-coverage.txt
96+
python -m pytest tests/ \
97+
--junitxml=pytest.xml \
98+
--ignore=tests/test_tts_sdp_end_to_end.py \
99+
--cov-report=term-missing:skip-covered \
100+
--cov=sdp \
101+
--durations=30 \
102+
-rs | tee pytest-coverage.txt
97103
98104
99105
# TODO: add some way to see if e2e tests were skipped
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
## Granary Dataset Creation Pipeline
2+
3+
### Overview
4+
5+
This configuration drives the **Granary pseudo-labelling pipeline** – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for **25 European languages**.
6+
7+
The first public release of **Granary** (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora:
8+
9+
- [espnet/yodas2](https://huggingface.co/datasets/espnet/yodas2)
10+
- [FBK-MT/mosel](https://huggingface.co/datasets/FBK-MT/mosel)
11+
- [PleIAs/YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons)
12+
13+
and is published as [nvidia/Granary](https://huggingface.co/datasets/nvidia/Granary).
14+
15+
> Note — Per-language runs
16+
>
17+
> The pipeline is executed once per language pair: set
18+
> - `source_lang` / `source_lang_full` – audio & transcript language
19+
> - `translation.target_lang` / `target_lang_full` – translation language
20+
>
21+
> For example, to obtain English audio with Italian translations choose `source_lang: en` and `translation.target_lang: it`. Separate runs are required for each additional language combination.
22+
23+
> Note — GPU required
24+
>
25+
> All Whisper, vLLM and Comet-QE stages expect at least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when `num_devices: -1` (default) is used.
26+
27+
### Software prerequisites
28+
29+
Install NeMo-speech-data-processor plus the extra wheels required by specific processors:
30+
31+
- `FasterWhisperInference`
32+
33+
```bash
34+
pip install pytorch-lightning \
35+
"nvidia-cublas-cu12" \
36+
"nvidia-cudnn-cu12==9.*" \
37+
faster_whisper
38+
39+
export LD_LIBRARY_PATH=$(python - <<'PY'
40+
import os, nvidia.cublas.lib, nvidia.cudnn.lib
41+
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))
42+
PY)
43+
```
44+
45+
- `vLLMInference`
46+
47+
```bash
48+
pip install "optree>=0.13.0" vllm
49+
```
50+
51+
- `CometoidWMTQualityEstimation`
52+
53+
```bash
54+
pip install pymarian
55+
```
56+
57+
- `FastTextLangIdClassifier`
58+
59+
```bash
60+
pip install fasttext
61+
```
62+
63+
- `ConvertToTarredAudioDataset` (optional, only if tar-sharding is enabled)
64+
65+
```bash
66+
pip install lhotse "nemo-toolkit[common]==2.2.1"
67+
```
68+
69+
### Quick start
70+
71+
1. **Hardware** – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes).
72+
2. **Install** NeMo-speech-data-processor and the extras listed above.
73+
3. **Prepare** the input manifest and set three mandatory YAML keys:
74+
- `input_manifest_file` – manifest with raw audio paths
75+
- `output_dir` – working/output directory
76+
- `sdp_dir` – root of the SDP tree (for prompt/regex assets)
77+
4. **Run the pipeline**:
78+
79+
```bash
80+
# Path to your local clone of NeMo-speech-data-processor
81+
SDP_DIR=/path/to/NeMo-speech-data-processor
82+
83+
python ${SDP_DIR}/main.py \
84+
--config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \
85+
--config-name config.yaml \
86+
input_manifest_file=/path/to/input_manifest.json \
87+
output_dir=/path/to/output/dir \
88+
sdp_dir=${SDP_DIR}
89+
```
90+
91+
### Input and output formats
92+
93+
#### Input manifest
94+
95+
Each line is a JSON object with the source-audio path:
96+
97+
```json
98+
{"source_audio_filepath": "/path/to/file.flac"}
99+
```
100+
101+
#### Key outputs
102+
103+
- `${output_dir}/${source_lang}/manifest_46.json` – final bilingual manifest containing `audio_filepath`, `offset`, `duration`, `text` (source) and `answer` (translation), plus constant decoder flags.
104+
- `${output_dir}/${source_lang}/tarred_dataset/` – optional tarred-audio shards and `shard_manifest.json` when `convert_to_audio_tarred_dataset.should_run: True`.
105+
- All intermediate `manifest_XX.json` files are kept for audit/debug.
106+
107+
### Pipeline stages
108+
109+
The processors executed (indices match the config):
110+
111+
- **FfmpegConvert** (0) – re-encode audio to 16 kHz/mono FLAC.
112+
- **GetAudioDuration** (1) – compute clip length.
113+
- **RemoveFiles** (2) – optionally delete originals (`params.save_disk_space`).
114+
- **FasterWhisperInference** (3) – pass 1 language detection.
115+
- **LambdaExpression** (4) – probability-based LID filtering.
116+
- **DropSpecifiedFields** (5) – remove temporary fields.
117+
- **FasterWhisperInference** (6, 14) – two-pass transcription (second run can slice by offset).
118+
- **Segmentation & grooming** (7–13) – split Whisper segments into atomic utterances.
119+
- **Hallucination detection** (18–20) – drop repeated n-grams, garbage tokens and common filler phrases.
120+
- **PnC restoration** (21–23) – `Qwen-2.5-7B` restores punctuation & capitalisation; optional regex clean-up.
121+
- **Length & charset filtering** (27–36) – word-ratio, character histogram and FastText checks.
122+
- **Quality estimation** (41–43) – keep pairs with `Comet-QE score ≥ min_qe_score`.
123+
- **Constant flags** (44) – add decoder directives (`<|emo:undefined|>`, `itn`, `pnc`, etc.).
124+
- **Tarred dataset** (46) – shard audio into `num_shards` tar files (optional).
125+
126+
### Tunable parameters
127+
128+
All knobs live under the `params` block.
129+
130+
- **Language**
131+
- `source_lang` / `source_lang_full`
132+
- `translation.target_lang` / `target_lang_full`
133+
134+
- **Audio duration**
135+
- `min_audio_duration` – drop very short clips (seconds)
136+
- `max_audio_duration` – drop very long clips (seconds)
137+
138+
- **Language-ID & text filtering**
139+
- `min_audio_lid_probability` – Whisper LID threshold
140+
- `translation.min_hist_token_ratio` – charset-purity ratio
141+
- `translation.min_text_lid_probability` – FastText LID threshold
142+
143+
- **Length & quality**
144+
- `translation.max_len_diff_ratio` – max(src / tgt) word ratio
145+
- `translation.min_qe_score` – Comet-QE acceptance score
146+
147+
- **Tarred dataset**
148+
- `convert_to_audio_tarred_dataset.should_run` (bool)
149+
- `num_shards` and `buckets_num` – shard layout
150+
151+
- **Misc.**
152+
- `use_regex` – regex preset for text normalisation
153+
- `save_disk_space` – delete originals after conversion
154+
- `use_dask` – enable distributed execution (not recommended)
155+
156+
### Advanced usage
157+
158+
- **Selective execution** – override `processors_to_run` with a range of indices, e.g. `"0:25"`.
159+
- **Model swapping** – every inference processor exposes either `model_size_or_path` (Whisper) or an embedded `model:` block (vLLM).
160+
- **Resource tuning** – `num_devices = -1` uses all visible GPUs; set an integer to pin workers per stage.
161+
162+
### References
163+
164+
- Koluguri et al. (2025). Granary: Speech Recognition and Translation Dataset in 25 European Languages (preprint). arXiv: [2505.13404](https://arxiv.org/abs/2505.13404),
165+
- [nvidia/Granary](https://huggingface.co/datasets/nvidia/Granary) dataset on Hugging Face,
166+
- NeMo-SDP source [code](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/multilingual/granary/>).

0 commit comments

Comments
 (0)