|
| 1 | +## Granary Dataset Creation Pipeline |
| 2 | + |
| 3 | +### Overview |
| 4 | + |
| 5 | +This configuration drives the **Granary pseudo-labelling pipeline** – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for **25 European languages**. |
| 6 | + |
| 7 | +The first public release of **Granary** (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora: |
| 8 | + |
| 9 | +- [espnet/yodas2](https://huggingface.co/datasets/espnet/yodas2) |
| 10 | +- [FBK-MT/mosel](https://huggingface.co/datasets/FBK-MT/mosel) |
| 11 | +- [PleIAs/YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons) |
| 12 | + |
| 13 | +and is published as [nvidia/Granary](https://huggingface.co/datasets/nvidia/Granary). |
| 14 | + |
| 15 | +> Note — Per-language runs |
| 16 | +> |
| 17 | +> The pipeline is executed once per language pair: set |
| 18 | +> - `source_lang` / `source_lang_full` – audio & transcript language |
| 19 | +> - `translation.target_lang` / `target_lang_full` – translation language |
| 20 | +> |
| 21 | +> For example, to obtain English audio with Italian translations choose `source_lang: en` and `translation.target_lang: it`. Separate runs are required for each additional language combination. |
| 22 | +
|
| 23 | +> Note — GPU required |
| 24 | +> |
| 25 | +> All Whisper, vLLM and Comet-QE stages expect at least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when `num_devices: -1` (default) is used. |
| 26 | +
|
| 27 | +### Software prerequisites |
| 28 | + |
| 29 | +Install NeMo-speech-data-processor plus the extra wheels required by specific processors: |
| 30 | + |
| 31 | +- `FasterWhisperInference` |
| 32 | + |
| 33 | +```bash |
| 34 | +pip install pytorch-lightning \ |
| 35 | + "nvidia-cublas-cu12" \ |
| 36 | + "nvidia-cudnn-cu12==9.*" \ |
| 37 | + faster_whisper |
| 38 | + |
| 39 | +export LD_LIBRARY_PATH=$(python - <<'PY' |
| 40 | +import os, nvidia.cublas.lib, nvidia.cudnn.lib |
| 41 | +print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__)) |
| 42 | +PY) |
| 43 | +``` |
| 44 | +
|
| 45 | +- `vLLMInference` |
| 46 | +
|
| 47 | +```bash |
| 48 | +pip install "optree>=0.13.0" vllm |
| 49 | +``` |
| 50 | +
|
| 51 | +- `CometoidWMTQualityEstimation` |
| 52 | +
|
| 53 | +```bash |
| 54 | +pip install pymarian |
| 55 | +``` |
| 56 | +
|
| 57 | +- `FastTextLangIdClassifier` |
| 58 | +
|
| 59 | +```bash |
| 60 | +pip install fasttext |
| 61 | +``` |
| 62 | +
|
| 63 | +- `ConvertToTarredAudioDataset` (optional, only if tar-sharding is enabled) |
| 64 | +
|
| 65 | +```bash |
| 66 | +pip install lhotse "nemo-toolkit[common]==2.2.1" |
| 67 | +``` |
| 68 | +
|
| 69 | +### Quick start |
| 70 | +
|
| 71 | +1. **Hardware** – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes). |
| 72 | +2. **Install** NeMo-speech-data-processor and the extras listed above. |
| 73 | +3. **Prepare** the input manifest and set three mandatory YAML keys: |
| 74 | + - `input_manifest_file` – manifest with raw audio paths |
| 75 | + - `output_dir` – working/output directory |
| 76 | + - `sdp_dir` – root of the SDP tree (for prompt/regex assets) |
| 77 | +4. **Run the pipeline**: |
| 78 | +
|
| 79 | +```bash |
| 80 | +# Path to your local clone of NeMo-speech-data-processor |
| 81 | +SDP_DIR=/path/to/NeMo-speech-data-processor |
| 82 | +
|
| 83 | +python ${SDP_DIR}/main.py \ |
| 84 | + --config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \ |
| 85 | + --config-name config.yaml \ |
| 86 | + input_manifest_file=/path/to/input_manifest.json \ |
| 87 | + output_dir=/path/to/output/dir \ |
| 88 | + sdp_dir=${SDP_DIR} |
| 89 | +``` |
| 90 | +
|
| 91 | +### Input and output formats |
| 92 | +
|
| 93 | +#### Input manifest |
| 94 | +
|
| 95 | +Each line is a JSON object with the source-audio path: |
| 96 | +
|
| 97 | +```json |
| 98 | +{"source_audio_filepath": "/path/to/file.flac"} |
| 99 | +``` |
| 100 | +
|
| 101 | +#### Key outputs |
| 102 | +
|
| 103 | + - `${output_dir}/${source_lang}/manifest_46.json` – final bilingual manifest containing `audio_filepath`, `offset`, `duration`, `text` (source) and `answer` (translation), plus constant decoder flags. |
| 104 | + - `${output_dir}/${source_lang}/tarred_dataset/` – optional tarred-audio shards and `shard_manifest.json` when `convert_to_audio_tarred_dataset.should_run: True`. |
| 105 | + - All intermediate `manifest_XX.json` files are kept for audit/debug. |
| 106 | +
|
| 107 | +### Pipeline stages |
| 108 | +
|
| 109 | +The processors executed (indices match the config): |
| 110 | +
|
| 111 | +- **FfmpegConvert** (0) – re-encode audio to 16 kHz/mono FLAC. |
| 112 | +- **GetAudioDuration** (1) – compute clip length. |
| 113 | +- **RemoveFiles** (2) – optionally delete originals (`params.save_disk_space`). |
| 114 | +- **FasterWhisperInference** (3) – pass 1 language detection. |
| 115 | +- **LambdaExpression** (4) – probability-based LID filtering. |
| 116 | +- **DropSpecifiedFields** (5) – remove temporary fields. |
| 117 | +- **FasterWhisperInference** (6, 14) – two-pass transcription (second run can slice by offset). |
| 118 | +- **Segmentation & grooming** (7–13) – split Whisper segments into atomic utterances. |
| 119 | +- **Hallucination detection** (18–20) – drop repeated n-grams, garbage tokens and common filler phrases. |
| 120 | +- **PnC restoration** (21–23) – `Qwen-2.5-7B` restores punctuation & capitalisation; optional regex clean-up. |
| 121 | +- **Length & charset filtering** (27–36) – word-ratio, character histogram and FastText checks. |
| 122 | +- **Quality estimation** (41–43) – keep pairs with `Comet-QE score ≥ min_qe_score`. |
| 123 | +- **Constant flags** (44) – add decoder directives (`<|emo:undefined|>`, `itn`, `pnc`, etc.). |
| 124 | +- **Tarred dataset** (46) – shard audio into `num_shards` tar files (optional). |
| 125 | +
|
| 126 | +### Tunable parameters |
| 127 | +
|
| 128 | +All knobs live under the `params` block. |
| 129 | +
|
| 130 | +- **Language** |
| 131 | + - `source_lang` / `source_lang_full` |
| 132 | + - `translation.target_lang` / `target_lang_full` |
| 133 | +
|
| 134 | +- **Audio duration** |
| 135 | + - `min_audio_duration` – drop very short clips (seconds) |
| 136 | + - `max_audio_duration` – drop very long clips (seconds) |
| 137 | +
|
| 138 | +- **Language-ID & text filtering** |
| 139 | + - `min_audio_lid_probability` – Whisper LID threshold |
| 140 | + - `translation.min_hist_token_ratio` – charset-purity ratio |
| 141 | + - `translation.min_text_lid_probability` – FastText LID threshold |
| 142 | +
|
| 143 | +- **Length & quality** |
| 144 | + - `translation.max_len_diff_ratio` – max(src / tgt) word ratio |
| 145 | + - `translation.min_qe_score` – Comet-QE acceptance score |
| 146 | +
|
| 147 | +- **Tarred dataset** |
| 148 | + - `convert_to_audio_tarred_dataset.should_run` (bool) |
| 149 | + - `num_shards` and `buckets_num` – shard layout |
| 150 | +
|
| 151 | +- **Misc.** |
| 152 | + - `use_regex` – regex preset for text normalisation |
| 153 | + - `save_disk_space` – delete originals after conversion |
| 154 | + - `use_dask` – enable distributed execution (not recommended) |
| 155 | +
|
| 156 | +### Advanced usage |
| 157 | +
|
| 158 | +- **Selective execution** – override `processors_to_run` with a range of indices, e.g. `"0:25"`. |
| 159 | +- **Model swapping** – every inference processor exposes either `model_size_or_path` (Whisper) or an embedded `model:` block (vLLM). |
| 160 | +- **Resource tuning** – `num_devices = -1` uses all visible GPUs; set an integer to pin workers per stage. |
| 161 | +
|
| 162 | +### References |
| 163 | +
|
| 164 | +- Koluguri et al. (2025). Granary: Speech Recognition and Translation Dataset in 25 European Languages (preprint). arXiv: [2505.13404](https://arxiv.org/abs/2505.13404), |
| 165 | +- [nvidia/Granary](https://huggingface.co/datasets/nvidia/Granary) dataset on Hugging Face, |
| 166 | +- NeMo-SDP source [code](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/multilingual/granary/>). |
0 commit comments