[trainer] feat: vlm support for sft engine #3729

techkang · 2025-10-11T07:34:11Z

What does this PR do?

This PR introduces support for VLM in SFT training.

It builds upon the work in PR #3590 and PR #3589, incorporating their contributions while fixing existing bugs. Due to the significant number of new features added to the main branch since the original PRs were opened, rebasing them became impractical. This new PR serves as a consolidated and up-to-date implementation.

The code is tested through test_sft_engine_vllm_all.sh and the result is as follow. (megatron is not tested by now because it's loss is too high.)

+ python3 tests/special_e2e/sft/compare_sft_engine_results.py --sub_dir verl_vlm_sft_test --loss_only
compare results mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl
All results are close to golden results

gemini-code-assist

Code Review

This pull request introduces support for Vision-Language Models (VLM) in the SFT trainer. The changes are comprehensive, affecting dataset creation, data processing, and the training engine to handle multi-modal inputs. While the overall approach is sound, I've identified a few critical issues related to data processing and model output handling that could lead to runtime errors or incorrect behavior. Please address these points to ensure the stability and correctness of the new VLM functionality.

gemini-code-assist · 2025-10-11T07:36:34Z

verl/utils/dataset/multiturn_sft_dataset.py

+            for conv in messages:
+                for content in conv["content"]:
+                    for k, v in content.items():
+                        if v is None:
+                            content.pop(k)
+                            break


This loop for removing None values from the content dictionary is incorrect. The break statement will cause the inner loop to exit after removing only the first key with a None value. If a content dictionary contains multiple None values, the subsequent ones will not be removed, which could lead to unexpected behavior or errors downstream. Additionally, modifying a dictionary while iterating over its items is unsafe and can lead to unpredictable results.

A more robust and Pythonic way to achieve this is to use a list comprehension to rebuild the list of content dictionaries, filtering out any keys with None values.

for conv in messages: conv["content"] = [ {k: v for k, v in content.items() if v is not None} for content in conv["content"] ]

gemini-code-assist · 2025-10-11T07:36:34Z

verl/workers/engine/fsdp/transformer_impl.py

+                if hasattr(output, "last_hidden_state"):
+                    logits = output.last_hidden_state
+                else:
+                    logits = output.logits


There is a potential type and shape mismatch here. output.last_hidden_state typically contains embeddings from the transformer body, with a shape of (batch_size, seq_len, hidden_size). However, the variable is named logits, and it's later used in logprobs_from_logits, which expects model logits of shape (batch_size, seq_len, vocab_size). Assigning hidden states to logits will likely cause a dimension mismatch error and incorrect calculations downstream. If the intention is to handle models that don't have a language model head applied, the hidden states should be passed through an LM head before being treated as logits.

gemini-code-assist · 2025-10-11T07:36:34Z

verl/workers/engine/utils.py

+    if "multi_modal_inputs" in micro_batches[0]:
+        multi_modal_data = micro_batches[0]["multi_modal_inputs"]
+        for batch, indexes in zip(micro_batches, batch_idx_list, strict=False):
+            batch["multi_modal_inputs"] = [multi_modal_data[i] for i in indexes]


This block of code for distributing multi_modal_inputs will cause a TypeError when use_dynamic_bsz is False. In that case, batch_idx_list is None, and passing None to zip() is not allowed. This logic is only applicable when dynamic batch sizing is enabled, as batch_idx_list is only populated in that scenario. To fix this, this block should be moved inside the if use_dynamic_bsz: block, right after rearrange_micro_batches is called.

ccclyu

Before we merge into main, can we also have some correctness validation in the PR description like training curve in wandb or similar logs? cc: @vermouth1992

techkang · 2025-10-12T15:59:31Z

@ccclyu Thanks for your advice. I updated training curves.

techkang · 2025-10-15T04:23:13Z

@vermouth1992 The FSDP part is finished. The loss are all close but the grad norms are slightly different. Do you think it's ok?

techkang · 2025-10-16T12:23:05Z

The grad norm cannot strictly match, but it seems that it will not effect losses. And it's really difficult to find the root cause.

File                           Loss            Grad Norm      
============================================================
golden.jsonl                   0.497927        16.012671      
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.500616        24.772402      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502082        16.083515      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.498409        16.051727      
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.501057        16.041042      
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502616        16.145720      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.502082        16.095524      
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502082        16.083515      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.500927        34.486794      
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502616        16.145720      
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.500999        33.218281      
golden.jsonl                   0.259755        10.312643      
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265249        12.020154      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.263239        12.139799      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.261769        10.271201      
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.264032        14.794047      
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265064        10.942727      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.261775        12.690385      
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.263239        12.139799      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265909        11.989518      
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265064        10.942728      
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.266730        10.292510      
golden.jsonl                   0.137759        21.477085      
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.138107        17.431028      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.136704        16.284973      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.134795        15.737302      
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.141937        15.894790      
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.137041        25.760611      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.136577        15.796288      
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.136704        16.284975      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.134639        16.124317      
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.137041        25.760609      
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.135513        23.860678      
golden.jsonl                   0.084934        6.851436       
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.082373        9.818422       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.081413        10.701051      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.081043        5.753160       
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.083465        7.913788       
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.086534        8.471581       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.079530        6.781807       
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.081427        10.528720      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.080989        5.551077       
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.086534        8.471581       
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.082342        5.646751       
golden.jsonl                   0.054665        4.677221       
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055944        4.042932       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055999        3.962130       
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.052857        4.031875       
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055214        3.828687       
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.057422        3.949810       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.051577        4.000427       
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055772        7.157363       
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.050208        4.739748       
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.057419        3.958420       
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.053807        5.048426

techkang requested review from PeterSH6, ZihengJiang, eric-haibin-lin, tongyx361 and vermouth1992 as code owners October 11, 2025 07:34

gemini-code-assist bot reviewed Oct 11, 2025

View reviewed changes

ccclyu reviewed Oct 12, 2025

View reviewed changes

techkang added 4 commits October 13, 2025 09:03

vlm support

90b02dd

support fixed bs

5c1bc30

test for all three kind padding

4be176c

add unit test

e36b6c9

techkang force-pushed the vlm_new branch from 3abc0cc to e36b6c9 Compare October 13, 2025 03:12

techkang added 2 commits October 13, 2025 21:28

support sp>1

da8c0c5

fix some bugs about vlm

e412e10

techkang requested a review from ISEEKYAN as a code owner October 14, 2025 14:07

techkang and others added 2 commits October 15, 2025 12:18

final version

6487482

Merge branch 'main' into vlm_new

b8feaec

techkang added 5 commits October 15, 2025 17:08

fix ci break

a300414

fix

d9657e5

revert unintentional change

93edfc2

fix position embedding

24b335e

fix position id

ca6f6a2

techkang requested review from FightingZhen and ji-huazhong as code owners October 15, 2025 11:42

techkang changed the title ~~[trainer] feat: vlm support~~ [trainer] feat: vlm support for sft engine Oct 15, 2025

techkang added 3 commits October 15, 2025 20:12

only forward mm inputs for vl model

b3dc584

fix

1aa1871

fix sft result test issue

773f90a

check loss only for vlm

8f1e893

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[trainer] feat: vlm support for sft engine #3729

[trainer] feat: vlm support for sft engine #3729

techkang commented Oct 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Uh oh!

gemini-code-assist bot Oct 11, 2025

Uh oh!

gemini-code-assist bot Oct 11, 2025

Uh oh!

ccclyu left a comment

Uh oh!

techkang commented Oct 12, 2025

Uh oh!

techkang commented Oct 15, 2025

Uh oh!

techkang commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[trainer] feat: vlm support for sft engine #3729

Are you sure you want to change the base?

[trainer] feat: vlm support for sft engine #3729

Conversation

techkang commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu left a comment

Choose a reason for hiding this comment

Uh oh!

techkang commented Oct 12, 2025

Uh oh!

techkang commented Oct 15, 2025

Uh oh!

techkang commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

techkang commented Oct 11, 2025 •

edited

Loading