Skip to content

gia-uh/mason

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧱 MASON

Model Adaptation & Synthetic Optimization eNgine

MASON is an end-to-end toolkit designed to automate the complex workflow of creating domain-specific, small language models. It takes you from your raw documents (PDFs, DOCX, etc.) to a fully trained, evaluated, and deployment-ready model.

The core of MASON is Synthetic Optimization: it uses strong "judge" LLMs (like GPT-4) to generate a high-quality, synthetic instruction dataset from your own knowledge base. It then uses this data to adapt an open-source base model (like Llama 3 or Phi-3) using Parameter-Efficient Fine-Tuning (PEFT).

⚠️ MASON is under construction

This project is still in a very early stage and not ready for production. Most of the features explained in this README are still under design. The API can and will change dramatically before we reach version 1.0.

The goal: Stop building complex RAG pipelines and start training small, expert models that already know your domain.

🌟 Key Features

  • Synthetic Data Generation: Automatically generate high-quality instruction-following datasets from your local documents (.pdf, .docx, .pptx, .md, .html, .xml, etc.).
  • Flexible Model Support: Adapt any model available on the Hugging Face transformers library.
  • Optional Human-in-the-Loop (HITL): Use the built-in Text-based User Interface (TUI) to optionally seed, vet, or hand-correct the synthetically generated data for ultimate quality control.
  • Dual Configuration: Configure your pipelines programmatically with our Python API or declaratively with simple YAML files.
  • Built-in Evaluation: Automatically evaluate your adapted model using LLM-as-judge (with custom rubrics), standard metrics (ROUGE, BLEU), or your own custom evaluation logic.
  • Hardware Agnostic (with a boost): Runs on any hardware that transformers supports (CPU, Apple Silicon MPS). For high-performance NVIDIA training, MASON has optional unsloth support built-in.
  • Built-in Experiment Tracking: A simple, self-contained database tracks all your runs, metrics, and configurations, complete with simple reports for easy comparison.
  • Deployment-Ready: Export your final trained models to common formats like GGUF or MLX to be served by any inference engine (like vLLM).

⚙️ How it Works: The MASON Workflow

MASON automates a 5-step workflow, giving you control at every stage.

  1. Step 1: Ingest Knowledge Point MASON at a directory of your raw, unstructured documents. It will parse and index this knowledge base, preparing it for data generation.

  2. Step 2: Generate Synthetic Dataset MASON uses a powerful "judge" LLM (via API) to read your knowledge base and synthetically generate, answer, and pre-evaluate thousands of instruction/question-answer pairs, building a balanced dataset.

  3. Step 3 (Optional): Curate Dataset Launch the built-in TUI to become the human-in-the-loop. You can be fully hands-off, or you can choose to provide an initial "seed" of instructions or manually vet and edit the data generated by the LLM.

  4. Step 4: Define & Train Pipeline Define your entire pipeline in a single YAML file or using our Python API. Specify the base model, the PEFT method (e.g., LoRA), the training hyperparameters, and the evaluation rubrics. Then, simply run mason run.

  5. Step 5: Evaluate, Iterate & Export MASON runs the full pipeline: data generation, training, and evaluation. It saves the results, metrics, and model artifacts to its built-in experiment tracker. Review your results, tweak your config, and when you're satisfied, export the final model for production.

📦 Installation

Get started with MASON using pip:

pip install mason-ai

Hardware Requirements

MASON is built on Hugging Face transformers and is designed to be hardware-flexible.

  • Standard (CPU / Apple Silicon): The default pip install works perfectly for running MASON on a CPU or an Apple Silicon Mac (using MPS).

  • High-Performance (NVIDIA): For massively faster training on NVIDIA GPUs, MASON supports unsloth. To enable it, install MASON with the unsloth extra:

    pip install "mason-ai[unsloth]"
    
    # You will also need the NVIDIA Container Toolkit or relevant CUDA libraries
    # See: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

🚀 Quickstart: A Complete Workflow in YAML

You can define and run an entire workflow with a single YAML file.

1. Create your pipeline.yaml:

# pipeline.yaml
# A complete workflow to train a custom-domain model.

# 1. The base model to adapt
base_model: "meta-llama/Llama-3-8B-Instruct"

# 2. Your local knowledge base to learn from
knowledge_source:
  type: local
  path: ./my_company_docs/
  glob: "**/*.pdf"

# 3. Rules for generating the synthetic dataset
synthetic_data:
  judge_llm: "openai/gpt-4-turbo"  # The LLM to generate data
  task_type: "question_answering"
  num_instructions: 250

# 4. (Optional) Enable the TUI for human-in-the-loop curation
curation:
  enabled: true

# 5. Training configuration (PEFT/LoRA)
training:
  type: peft
  method: lora
  lora_r: 16
  lora_alpha: 32
  # Enable unsloth for NVIDIA speedup if installed
  use_unsloth: true

# 6. Evaluation metrics to run after training
evaluation:
  metrics: [rougeL, bleu]
  llm_judge:
    judge_llm: "openai/gpt-4o"
    rubrics: [faithfulness, conciseness, domain_accuracy]

# 7. Final model export for deployment
export:
  format: gguf
  path: ./exports/my_expert_model.gguf

2. Run the pipeline:

mason run pipeline.yaml

MASON will now execute the entire workflow, and you can monitor its progress.

A Note on the MASON CLI Philosophy

The MASON CLI is a thin wrapper around our powerful Python Task API. You will not find complex CLI commands with dozens of options and flags.

This is a deliberate design choice to ensure consistent and reproducible workflows. All of your "work"—the model configuration, data generation rules, and training parameters—lives durably in your pipeline.yaml (or Python code), not in your shell history.

  • Commands with Side Effects: Any command that changes something (generates data, runs training, etc.) is invoked via mason run. You can run the entire file or target a specific task:

    # Run the entire pipeline
    mason run pipeline.yaml
    
    # Run *only* the 'training' task from the pipeline
    mason run pipeline.yaml training
  • Read-Only Commands: Other "fire and forget" CLI commands exist for read-only operations, such as fetching statistics or generating reports from the experiment database.

Component Deep-Dive

📈 Integrated Experiment & Metrics Tracking

MASON provides a fully integrated solution for experiment tracking, powered by a custom TrainerCallback and our BeaverDB database.

When you run a training pipeline, MASON automatically injects a callback into the Hugging Face Trainer. This callback streams all training and evaluation metrics (like train/loss, eval/loss, epoch, etc.) directly from the trl trainer into your BeaverDB instance—potentially the very same database you used for data ingestion.

This approach means:

  • No External Tools: You don't need to set up or integrate external services like MLflow or Weights & Biases.
  • Unified Data: All your source data, synthetic datasets, configurations, and training metrics can live together in one place.
  • Live Metrics: You can monitor your training runs in real-time by querying the database.
  • Simple Reporting: MASON's CLI can then query this database to generate reports, compare runs, and track performance history.

🔩 Deployment Philosophy: Bring Your Own Inference

MASON is a training and adaptation toolkit, not an inference server. This is a deliberate design choice to keep the tool lightweight and focused.

Once you are happy with a trained model, you export it from MASON to a deployment-ready format (like GGUF, MLX, or just the LoRA adapter weights). You can then serve this model using any professional inference engine, such as vLLM, TGI, or llama.cpp.

🤝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md file for details on how to set up your environment and submit a pull request.

📄 License

This project is licensed under the MIT license.

About

An opinionated toolkit for small language model customization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages