GitHub - outerbounds/sft-project

End-to-end finetuning

This repository will show you how to do end-to-end fine-tuning on Outerbounds, from data integration to model training to serving the fine-tuned model as am OpenAI-compatible API. All of this happens inside your cloud account, driven by Outerbounds.

Repository structure

flows/
├── prepare_data/          # HuggingFace dataset ingestion
└── sft/                   # Supervised fine-tuning workflows

metaflow_extensions/       # Custom visualization plugins
util/                      # Shared ML utilities  
data/                      # Dataset registration and local cache
models/                    # Model registration and local artifacts

Pipeline: prepare_data -> sft -> vLLM deployment

Assets lifecycle

The purpose of assets is to register them explicitly and use the resulting data streams to construct the traceable lineages required to understand how ML systems function and why they produce the results they do. There are three types of assets in the Outerbounds system; code, data, and models. This repository is code, a dataset or model may come from HuggingFace or point to an S3 key.

Assets are registered in Python code. In general, you have complete flexibility in when to register assets:

user/ci launches manual FlowSpec run,
user/ci manually triggers run,
user/ci deploys FlowSpec,
system triggers run on event publication,
system runs FlowSpec on a schedule,
or even a user arbitrarily testing during development.

In this repository, assets are registered in CI. If you have not read about Metaflow projects before, now is a good time to pause, and return here after spending 5 minutes reading this documentation page. Branches are isolated namespaces within Metaflow projects, they can be used to test experimental workflows side-by-side with production ones or deploy multiple variants in the prod namespace. A key motivator for the "code as an asset" view of Outerbounds is that it unifies Metaflow's concept of branching with GitHub branches. The CI pattern shown in ./github/workflows/deploy.yaml shows how to use this as a pattern to

automate testing with end-to-end variants of the ML system
connect dev, staging, and prod contexts to ML systems CI/CD environments
automate system redployment, using familiar GitHub Actions
track code changes with changes to models and datasets

Dataset registration

# flows/prepare_data/flow.py

class PrepareData(ProjectFlow):
    ...
        ...
        self.prj.asset.register_data_asset(
            ds_name, kind="json", blobs=self.files
        )
        ...
    ...

Model registration

# flows/sft/flow.py
    ...
        ...
        self.prj.asset.register_model_asset(
            self.sft_model_ref['model_uuid'],
            kind="model",
            blobs=[self.sft_model_ref['url']],
            description=f"SFT model checkpoint for task {current.task_id}"
        )
        ...
    ...

`FlowSpec`s

First we will cache data from Huggingface datasets in our S3 bucket.

python flows/prepare_data/flow.py --environment=fast-bakery run --with kubernetes

Example: Data asset registered by PrepareData as shown in the Outerbounds UI

The dataset is prepared for fine-tuning, implemented in the next flow, which in turn registers our fine-tuned model.

python flows/sft/flow.py --environment=fast-bakery run --with kubernetes

Example: Model asset registered by SFT as shown in the Outerbounds UI

`outerbounds app`s

Finally, deploy an inference server using the custom-trained model:

outerbounds app deploy --config-file apps/vllm/config.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
apps/vllm		apps/vllm
data/alpaca-cleaned		data/alpaca-cleaned
flows		flows
metaflow_extensions/sft_project		metaflow_extensions/sft_project
models		models
static		static
util		util
.gitignore		.gitignore
README.md		README.md
obproject.toml		obproject.toml
project_spec.json		project_spec.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-end finetuning

Repository structure

Assets lifecycle

Dataset registration

Model registration

`FlowSpec`s

`outerbounds app`s

About

Uh oh!

Releases

Packages

Uh oh!

Languages

outerbounds/sft-project

Folders and files

Latest commit

History

Repository files navigation

End-to-end finetuning

Repository structure

Assets lifecycle

Dataset registration

Model registration

FlowSpecs

outerbounds apps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`FlowSpec`s

`outerbounds app`s

Packages