A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.
- Language-agnostic architecture: Works with any language through configurable Language Packs
- Pre-built packs: Fongbe, French, English, and more
- Easy customization: Create your own Language Pack in minutes
- Community-driven: Share and discover Language Packs from the community
- Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
- Linguistic Analysis: Readability scores, lexical complexity, morphological features
- Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
- Domain Analysis: Thematic balance, under/over-represented categories
- Gender Bias Detection: Gender representation, stereotype detection, balance metrics
- Custom analyzers: Add your own analysis modules without modifying core code
- Automatic discovery: Drop plugins into a directory and they're automatically loaded
- Language-specific analyzers: Create analyzers tailored to specific languages
- Interactive Dashboard: Beautiful HTML visualizations with Plotly
- Actionable Recommendations: Prioritized suggestions based on best practices
- Multiple Exports: JSON, CSV, PDF reports, execution logs
- Per-sentence annotations: Quality scores and flags for each sentence
# Install from PyPI
pip install langquality
# Install with all optional dependencies
pip install langquality[all]
# Download language models (if using spaCy-based packs)
python -m spacy download fr_core_news_md # For French
python -m spacy download en_core_web_md # For EnglishAnalyze a dataset with a specific language:
# Analyze Fongbe data
langquality analyze --input data/fongbe_sentences --output results --language fon
# Analyze French data
langquality analyze --input data/french_sentences --output results --language fra
# Analyze English data
langquality analyze --input data/english_sentences --output results --language eng# Open the interactive dashboard
open results/dashboard.htmlfrom langquality.pipeline import PipelineController
from langquality.language_packs import LanguagePackManager
from langquality.data import GenericDataLoader
# Load a language pack
pack_manager = LanguagePackManager()
language_pack = pack_manager.load_language_pack("fon")
# Load your data
loader = GenericDataLoader(language_pack)
sentences = loader.load_from_csv("data/sentences.csv")
# Run analysis
controller = PipelineController(language_pack)
results = controller.run(sentences)
# Access results
print(f"Total sentences: {results.structural.total_sentences}")
print(f"Average readability: {results.linguistic.avg_readability_score}")Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:
- Language-specific configuration (tokenization, thresholds, etc.)
- Linguistic resources (lexicons, stopwords, gender terms, etc.)
- Optional custom analyzers
| Language | Code | Status | Resources |
|---|---|---|---|
| Fongbe | fon |
β Stable | Full (lexicon, gender terms, ASR vocabulary) |
| French | fra |
β Stable | Full (lexicon, stopwords, gender terms, professions) |
| English | eng |
β Stable | Full (lexicon, stopwords, gender terms, professions) |
| Minangkabau | min |
π§ Minimal | Basic configuration only |
| Your Language | xxx |
π‘ Create one! | See Language Pack Guide |
# List installed packs
langquality pack list
# Show pack details
langquality pack info fon
# Create a new pack template
langquality pack create <language_code>
# Validate a pack
langquality pack validate path/to/packCreating a Language Pack for your language is straightforward:
-
Generate a template:
langquality pack create <your_language_code>
-
Configure the pack: Edit
config.yamlwith language-specific settings -
Add resources (optional): Add lexicons, stopwords, or other linguistic resources
-
Test it:
langquality pack validate path/to/your_pack langquality analyze --input test_data --output results --language <your_language_code>
See the Language Pack Guide for detailed instructions.
- Quickstart Guide: Get up and running in 5 minutes
- User Guide: Comprehensive usage documentation
- Language Pack Guide: Create and customize Language Packs
- Developer Guide: Extend LangQuality
- API Reference: Complete API documentation
- FAQ: Frequently asked questions
- Migration Guide: Migrating from fongbe-data-quality
LangQuality is designed for researchers and developers working with low-resource languages:
- ASR Dataset Preparation: Ensure text quality before audio recording
- Machine Translation: Validate parallel corpora quality
- Language Model Training: Assess dataset diversity and balance
- Corpus Linguistics: Analyze linguistic properties of text collections
- Data Curation: Filter and improve existing datasets
Override default thresholds and settings:
langquality analyze --input data --output results --language fon --config my_config.yamlExample configuration:
thresholds:
structural:
min_words: 5
max_words: 15
diversity:
target_ttr: 0.65
gender:
target_ratio: [0.45, 0.55]Create custom analyzers for specialized analysis:
from langquality.analyzers import Analyzer
class ToneAnalyzer(Analyzer):
"""Analyze tone and sentiment of sentences."""
def analyze(self, sentences):
# Your analysis logic
return metrics
def get_requirements(self):
return ["tone_lexicon"] # Required resourcesPlace your analyzer in the plugins directory and it will be automatically discovered.
See Creating Analyzers for details.
We welcome contributions from the community! Whether you're:
- π Creating a Language Pack for your language
- π§ Adding new analyzers or features
- π Improving documentation
- π Reporting bugs or issues
- π‘ Suggesting enhancements
Please see our Contributing Guide for:
- Code of Conduct
- Development setup
- Contribution workflow
- Coding standards
- Testing requirements
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Ensure tests pass:
pytest - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
Join our community to get help, share ideas, and collaborate:
- GitHub Discussions: Ask questions, share ideas, showcase your Language Packs
- Issue Tracker: Report bugs, request features
- Documentation: Comprehensive guides and API reference
- Contributing Guide: Learn how to contribute
- Code of Conduct: Our community standards
- π¬ Questions: Use GitHub Discussions Q&A
- π Bug Reports: Open an issue
- π‘ Feature Requests: Open an issue
- π Language Pack Submissions: Use our Language Pack template
LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.
Current version: 1.0.0 (Stable)
LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.
LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:
- The linguistic community working on African language preservation and NLP development
- Contributors who have created Language Packs and shared their expertise
- The open-source NLP community for tools and libraries that make this work possible
If you use LangQuality in your research, please cite:
@software{langquality_toolkit,
title={LangQuality: Language Quality Toolkit for Low-Resource Languages},
author={LangQuality Community},
year={2024},
url={https://github.com/langquality/langquality},
version={1.0.0}
}- Common Voice: Crowdsourced voice dataset
- FLORES: Multilingual translation benchmark
- Masakhane: African NLP community
Made with β€οΈ for low-resource language communities worldwide