Skip to content

laleye/langquality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

LangQuality - Language Quality Toolkit for Low-Resource Languages

PyPI version CI Status Coverage License: MIT Python 3.8+ Documentation

A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.

✨ Key Features

🌍 Multi-Language Support via Language Packs

  • Language-agnostic architecture: Works with any language through configurable Language Packs
  • Pre-built packs: Fongbe, French, English, and more
  • Easy customization: Create your own Language Pack in minutes
  • Community-driven: Share and discover Language Packs from the community

πŸ” Comprehensive Quality Analysis

  • Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
  • Linguistic Analysis: Readability scores, lexical complexity, morphological features
  • Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
  • Domain Analysis: Thematic balance, under/over-represented categories
  • Gender Bias Detection: Gender representation, stereotype detection, balance metrics

πŸ”Œ Extensible Plugin System

  • Custom analyzers: Add your own analysis modules without modifying core code
  • Automatic discovery: Drop plugins into a directory and they're automatically loaded
  • Language-specific analyzers: Create analyzers tailored to specific languages

πŸ“Š Rich Output Formats

  • Interactive Dashboard: Beautiful HTML visualizations with Plotly
  • Actionable Recommendations: Prioritized suggestions based on best practices
  • Multiple Exports: JSON, CSV, PDF reports, execution logs
  • Per-sentence annotations: Quality scores and flags for each sentence

πŸš€ Quick Start

Installation

# Install from PyPI
pip install langquality

# Install with all optional dependencies
pip install langquality[all]

# Download language models (if using spaCy-based packs)
python -m spacy download fr_core_news_md  # For French
python -m spacy download en_core_web_md   # For English

Basic Usage

Analyze a dataset with a specific language:

# Analyze Fongbe data
langquality analyze --input data/fongbe_sentences --output results --language fon

# Analyze French data
langquality analyze --input data/french_sentences --output results --language fra

# Analyze English data
langquality analyze --input data/english_sentences --output results --language eng

View Results

# Open the interactive dashboard
open results/dashboard.html

Python API

from langquality.pipeline import PipelineController
from langquality.language_packs import LanguagePackManager
from langquality.data import GenericDataLoader

# Load a language pack
pack_manager = LanguagePackManager()
language_pack = pack_manager.load_language_pack("fon")

# Load your data
loader = GenericDataLoader(language_pack)
sentences = loader.load_from_csv("data/sentences.csv")

# Run analysis
controller = PipelineController(language_pack)
results = controller.run(sentences)

# Access results
print(f"Total sentences: {results.structural.total_sentences}")
print(f"Average readability: {results.linguistic.avg_readability_score}")

πŸ“¦ Language Packs

Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:

  • Language-specific configuration (tokenization, thresholds, etc.)
  • Linguistic resources (lexicons, stopwords, gender terms, etc.)
  • Optional custom analyzers

Available Language Packs

Language Code Status Resources
Fongbe fon βœ… Stable Full (lexicon, gender terms, ASR vocabulary)
French fra βœ… Stable Full (lexicon, stopwords, gender terms, professions)
English eng βœ… Stable Full (lexicon, stopwords, gender terms, professions)
Minangkabau min 🚧 Minimal Basic configuration only
Your Language xxx πŸ’‘ Create one! See Language Pack Guide

Managing Language Packs

# List installed packs
langquality pack list

# Show pack details
langquality pack info fon

# Create a new pack template
langquality pack create <language_code>

# Validate a pack
langquality pack validate path/to/pack

Creating Your Own Language Pack

Creating a Language Pack for your language is straightforward:

  1. Generate a template:

    langquality pack create <your_language_code>
  2. Configure the pack: Edit config.yaml with language-specific settings

  3. Add resources (optional): Add lexicons, stopwords, or other linguistic resources

  4. Test it:

    langquality pack validate path/to/your_pack
    langquality analyze --input test_data --output results --language <your_language_code>

See the Language Pack Guide for detailed instructions.

πŸ“– Documentation

🎯 Use Cases

LangQuality is designed for researchers and developers working with low-resource languages:

  • ASR Dataset Preparation: Ensure text quality before audio recording
  • Machine Translation: Validate parallel corpora quality
  • Language Model Training: Assess dataset diversity and balance
  • Corpus Linguistics: Analyze linguistic properties of text collections
  • Data Curation: Filter and improve existing datasets

πŸ”§ Advanced Features

Custom Configuration

Override default thresholds and settings:

langquality analyze --input data --output results --language fon --config my_config.yaml

Example configuration:

thresholds:
  structural:
    min_words: 5
    max_words: 15
  diversity:
    target_ttr: 0.65
  gender:
    target_ratio: [0.45, 0.55]

Custom Analyzers

Create custom analyzers for specialized analysis:

from langquality.analyzers import Analyzer

class ToneAnalyzer(Analyzer):
    """Analyze tone and sentiment of sentences."""
    
    def analyze(self, sentences):
        # Your analysis logic
        return metrics
    
    def get_requirements(self):
        return ["tone_lexicon"]  # Required resources

Place your analyzer in the plugins directory and it will be automatically discovered.

See Creating Analyzers for details.

🀝 Contributing

We welcome contributions from the community! Whether you're:

  • 🌍 Creating a Language Pack for your language
  • πŸ”§ Adding new analyzers or features
  • πŸ“ Improving documentation
  • πŸ› Reporting bugs or issues
  • πŸ’‘ Suggesting enhancements

Please see our Contributing Guide for:

  • Code of Conduct
  • Development setup
  • Contribution workflow
  • Coding standards
  • Testing requirements

Quick Contribution Steps

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Ensure tests pass: pytest
  5. Commit: git commit -m 'Add amazing feature'
  6. Push: git push origin feature/amazing-feature
  7. Open a Pull Request

πŸ‘₯ Community

Join our community to get help, share ideas, and collaborate:

Support Channels

πŸ“Š Project Status

LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.

Current version: 1.0.0 (Stable)

πŸ“œ License

LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.

πŸ™ Acknowledgments

LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:

  • The linguistic community working on African language preservation and NLP development
  • Contributors who have created Language Packs and shared their expertise
  • The open-source NLP community for tools and libraries that make this work possible

πŸ“š Citation

If you use LangQuality in your research, please cite:

@software{langquality_toolkit,
  title={LangQuality: Language Quality Toolkit for Low-Resource Languages},
  author={LangQuality Community},
  year={2024},
  url={https://github.com/langquality/langquality},
  version={1.0.0}
}

πŸ”— Related Projects


Made with ❀️ for low-resource language communities worldwide

Get Started | Documentation | Community | Contributing

About

Language Quality Toolkit for Low-Resource Languages.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published