Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Antlera · 2025-07-03T01:23:42Z

Description:

This PR introduces scripts for benchmarking and fine-tuning with ZenFlow:

zf_benchmark.py: Benchmark script for evaluating offloading performance (adapted from offload_states.py by @tohtana ).
output_table.py: Parses and summarizes benchmark logs.
run_benchmark.sh: Automates benchmark runs with configurable parameters.
finetune_llama.py: Fine-tuning script for Llama-2 with DeepSpeed + ZenFlow.
finetune_llama.sh: Launch script for fine-tuning with environment setup.
zf_config.json: Example DeepSpeed config with ZenFlow optimizations.

Note: This PR is complimentary to PR #7391 on the main repo, and should be merged with (or after) merging PR #7391.

- Introduced `zf_benchmark.py` for model offloading benchmarking with DeepSpeed. - Added `output_table.py` to parse and display benchmark results in a tabular format. - Created `run_benchmark.sh` to automate benchmark runs with various configurations. Signed-off-by: Tingfeng Lan <[email protected]>

- Introduced `finetune_llama.py` for fine-tuning the Llama-2 model using DeepSpeed and ZenFlow. - Added `finetune_llama.sh` for automated training setup with environment variables and DeepSpeed command. - Added `zf_config.json` example for DeepSpeed configuration with ZenFlow optimizations. Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>

Antlera · 2025-07-03T01:29:36Z

Hi, @tohtana @tjruwase. Could you help review this PR when you have time? This is the example PR for PR #7391. Thanks!

training/DeepSpeed-ZenFlow/benchmark/README.md

training/DeepSpeed-ZenFlow/finetuning/README.md

Signed-off-by: Tingfeng Lan <[email protected]>

Antlera · 2025-08-04T18:47:42Z

@sfc-gh-truwase Thanks for the great suggestions — I’ve applied them all!

delock · 2025-08-06T05:41:22Z

Hi @Antlera I have a question. I saw ZenFlow has CPU running parameter update. Does DeepSpeed argument --bind_cores_to_rank helping with CPU optimizer efficiency? If true then maybe this switch could be add to launch script.

Here is a link to this switch. This is a switch for CPU backend, but for CPU offload this switch should help as well.
https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/accelerator-setup-guide.md#how-to-launch-deepspeed-on-intel-architecture-cpu

Antlera requested a review from tjruwase as a code owner July 3, 2025 01:23

Antlera and others added 2 commits July 2, 2025 21:27

Antlera force-pushed the zenflow_z1_2_example branch from ca441f5 to 0528aed Compare July 3, 2025 01:27

Antlera mentioned this pull request Jul 16, 2025

Add Zenflow code for Stage 1 & 2 deepspeedai/DeepSpeed#7391

Open

Antlera mentioned this pull request Aug 1, 2025

Add blog for ZenFlow deepspeedai/DeepSpeed#7463

Open

sfc-gh-truwase reviewed Aug 2, 2025

View reviewed changes

training/DeepSpeed-ZenFlow/benchmark/README.md Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Aug 2, 2025

View reviewed changes

training/DeepSpeed-ZenFlow/finetuning/README.md Show resolved Hide resolved

Antlera and others added 3 commits August 4, 2025 14:42

Add explanation tips for interpreting benchmark results in README

0309313

Signed-off-by: Tingfeng Lan <[email protected]>

Add guidance on step/latency interpretation

0b18cca

Signed-off-by: Tingfeng Lan <[email protected]>

Merge branch 'master' into zenflow_z1_2_example

c4946a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Antlera commented Jul 3, 2025

Uh oh!

Antlera commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Antlera commented Aug 4, 2025

Uh oh!

delock commented Aug 6, 2025

Uh oh!

Uh oh!

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Are you sure you want to change the base?

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Conversation

Antlera commented Jul 3, 2025

Uh oh!

Antlera commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Antlera commented Aug 4, 2025

Uh oh!

delock commented Aug 6, 2025

Uh oh!

Uh oh!

Antlera commented Jul 3, 2025 •

edited

Loading