Skip to content

outerbounds/sft-project

Repository files navigation

End-to-end finetuning

This repository will show you how to do end-to-end fine-tuning on Outerbounds, from data integration to model training to serving the fine-tuned model as am OpenAI-compatible API. All of this happens inside your cloud account, driven by Outerbounds.

Repository structure

flows/
├── prepare_data/          # HuggingFace dataset ingestion
└── sft/                   # Supervised fine-tuning workflows

metaflow_extensions/       # Custom visualization plugins
util/                      # Shared ML utilities  
data/                      # Dataset registration and local cache
models/                    # Model registration and local artifacts

Pipeline: prepare_data -> sft -> vLLM deployment

Assets lifecycle

The purpose of assets is to register them explicitly and use the resulting data streams to construct the traceable lineages required to understand how ML systems function and why they produce the results they do. There are three types of assets in the Outerbounds system; code, data, and models. This repository is code, a dataset or model may come from HuggingFace or point to an S3 key.

Assets are registered in Python code. In general, you have complete flexibility in when to register assets:

  • user/ci launches manual FlowSpec run,
  • user/ci manually triggers run,
  • user/ci deploys FlowSpec,
  • system triggers run on event publication,
  • system runs FlowSpec on a schedule,
  • or even a user arbitrarily testing during development.

In this repository, assets are registered in CI. If you have not read about Metaflow projects before, now is a good time to pause, and return here after spending 5 minutes reading this documentation page. Branches are isolated namespaces within Metaflow projects, they can be used to test experimental workflows side-by-side with production ones or deploy multiple variants in the prod namespace. A key motivator for the "code as an asset" view of Outerbounds is that it unifies Metaflow's concept of branching with GitHub branches. The CI pattern shown in ./github/workflows/deploy.yaml shows how to use this as a pattern to

  • automate testing with end-to-end variants of the ML system
  • connect dev, staging, and prod contexts to ML systems CI/CD environments
  • automate system redployment, using familiar GitHub Actions
  • track code changes with changes to models and datasets

Dataset registration

# flows/prepare_data/flow.py

class PrepareData(ProjectFlow):
    ...
        ...
        self.prj.asset.register_data_asset(
            ds_name, kind="json", blobs=self.files
        )
        ...
    ...

Model registration

# flows/sft/flow.py
    ...
        ...
        self.prj.asset.register_model_asset(
            self.sft_model_ref['model_uuid'],
            kind="model",
            blobs=[self.sft_model_ref['url']],
            description=f"SFT model checkpoint for task {current.task_id}"
        )
        ...
    ...

FlowSpecs

First we will cache data from Huggingface datasets in our S3 bucket.

python flows/prepare_data/flow.py --environment=fast-bakery run --with kubernetes

Assets project overview diagram

Example: Data asset registered by PrepareData as shown in the Outerbounds UI

The dataset is prepared for fine-tuning, implemented in the next flow, which in turn registers our fine-tuned model.

python flows/sft/flow.py --environment=fast-bakery run --with kubernetes

Assets project overview diagram

Example: Model asset registered by SFT as shown in the Outerbounds UI

outerbounds apps

Finally, deploy an inference server using the custom-trained model:

outerbounds app deploy --config-file apps/vllm/config.yaml

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published