Skip to content

Conversation

ssh-meister
Copy link
Collaborator

@ssh-meister ssh-meister commented Jul 2, 2025

This pull request introduces a set of modular components for processing the Granary dataset.

🔧 General-Purpose Processors:

These processors are not specific to any single dataset and can be reused across different data pipelines:

  1. LambdaExpression processor LambdaExpression processor implemetation #136
  2. SubRegex processor: adds support for extracting a list of regex parameters from a YAML file SubRegex processor: substitution rules from an external YAML  #137
  3. ExtractTar, RemoveFiles processors Add RemoveFiles and ExtractTar, reorganize audio converters #139
  4. FasterWhisperInference, DetectWhisperHallucinationFeatures, vLLMInference and CleanQwenGeneration Refactor inference processes & add new engines (FasterWhisper, vLLM) #141
  5. ListToEntries processor ListToEntries processor #140
  6. DropSpecifiedFields processor DropSpecifiedFields processor implemetation  #144
  7. CharacterHistogramLangValidator processor CharacterHistogramLangValidator processor implementation #154
  8. FastTextLangIdClassifier processor FastTextLangIdClassifier processor implementation #149
  9. CometoidWMTQualityEstimation processor CometoidWMTQualityEstimation processor implementation #151
  10. ConvertToTarredAudioDataset processor ConvertToTarredAudioDataset processor implemetation #145

⛓️ Pipelines

  1. Unified pipeline and README with instructions and documentation Granary large-scale speech processing pipeline  #155

@ssh-meister ssh-meister self-assigned this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant