Skip to content

Conversation

techkang
Copy link
Collaborator

@techkang techkang commented Oct 11, 2025

What does this PR do?

This PR introduces support for VLM in SFT training.

It builds upon the work in PR #3590 and PR #3589, incorporating their contributions while fixing existing bugs. Due to the significant number of new features added to the main branch since the original PRs were opened, rebasing them became impractical. This new PR serves as a consolidated and up-to-date implementation.

The code is tested through test_sft_engine_vllm_all.sh and the result is as follow. (megatron is not tested by now because it's loss is too high.)

+ python3 tests/special_e2e/sft/compare_sft_engine_results.py --sub_dir verl_vlm_sft_test --loss_only
compare results mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl
compare results mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl
All results are close to golden results

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Vision-Language Models (VLM) in the SFT trainer. The changes are comprehensive, affecting dataset creation, data processing, and the training engine to handle multi-modal inputs. While the overall approach is sound, I've identified a few critical issues related to data processing and model output handling that could lead to runtime errors or incorrect behavior. Please address these points to ensure the stability and correctness of the new VLM functionality.

Comment on lines +277 to +278
for conv in messages:
for content in conv["content"]:
for k, v in content.items():
if v is None:
content.pop(k)
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This loop for removing None values from the content dictionary is incorrect. The break statement will cause the inner loop to exit after removing only the first key with a None value. If a content dictionary contains multiple None values, the subsequent ones will not be removed, which could lead to unexpected behavior or errors downstream. Additionally, modifying a dictionary while iterating over its items is unsafe and can lead to unpredictable results.

A more robust and Pythonic way to achieve this is to use a list comprehension to rebuild the list of content dictionaries, filtering out any keys with None values.

            for conv in messages:
                conv["content"] = [
                    {k: v for k, v in content.items() if v is not None} for content in conv["content"]
                ]

Comment on lines 946 to 898
if hasattr(output, "last_hidden_state"):
logits = output.last_hidden_state
else:
logits = output.logits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential type and shape mismatch here. output.last_hidden_state typically contains embeddings from the transformer body, with a shape of (batch_size, seq_len, hidden_size). However, the variable is named logits, and it's later used in logprobs_from_logits, which expects model logits of shape (batch_size, seq_len, vocab_size). Assigning hidden states to logits will likely cause a dimension mismatch error and incorrect calculations downstream. If the intention is to handle models that don't have a language model head applied, the hidden states should be passed through an LM head before being treated as logits.

Comment on lines 56 to 60
if "multi_modal_inputs" in micro_batches[0]:
multi_modal_data = micro_batches[0]["multi_modal_inputs"]
for batch, indexes in zip(micro_batches, batch_idx_list, strict=False):
batch["multi_modal_inputs"] = [multi_modal_data[i] for i in indexes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code for distributing multi_modal_inputs will cause a TypeError when use_dynamic_bsz is False. In that case, batch_idx_list is None, and passing None to zip() is not allowed. This logic is only applicable when dynamic batch sizing is enabled, as batch_idx_list is only populated in that scenario. To fix this, this block should be moved inside the if use_dynamic_bsz: block, right after rearrange_micro_batches is called.

Copy link
Collaborator

@ccclyu ccclyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we merge into main, can we also have some correctness validation in the PR description like training curve in wandb or similar logs? cc: @vermouth1992

@techkang
Copy link
Collaborator Author

@ccclyu Thanks for your advice. I updated training curves.

@techkang techkang requested a review from ISEEKYAN as a code owner October 14, 2025 14:07
@techkang
Copy link
Collaborator Author

@vermouth1992 The FSDP part is finished. The loss are all close but the grad norms are slightly different. Do you think it's ok?

@techkang techkang changed the title [trainer] feat: vlm support [trainer] feat: vlm support for sft engine Oct 15, 2025
@techkang
Copy link
Collaborator Author

The grad norm cannot strictly match, but it seems that it will not effect losses. And it's really difficult to find the root cause.

File                           Loss            Grad Norm      
============================================================
golden.jsonl                   0.497927        16.012671      
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.500616        24.772402      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502082        16.083515      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.498409        16.051727      
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.501057        16.041042      
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502616        16.145720      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.502082        16.095524      
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502082        16.083515      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.500927        34.486794      
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.502616        16.145720      
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.500999        33.218281      
golden.jsonl                   0.259755        10.312643      
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265249        12.020154      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.263239        12.139799      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.261769        10.271201      
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.264032        14.794047      
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265064        10.942727      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.261775        12.690385      
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.263239        12.139799      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265909        11.989518      
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.265064        10.942728      
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.266730        10.292510      
golden.jsonl                   0.137759        21.477085      
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.138107        17.431028      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.136704        16.284973      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.134795        15.737302      
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.141937        15.894790      
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.137041        25.760611      
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.136577        15.796288      
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.136704        16.284975      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.134639        16.124317      
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.137041        25.760609      
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.135513        23.860678      
golden.jsonl                   0.084934        6.851436       
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.082373        9.818422       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.081413        10.701051      
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.081043        5.753160       
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.083465        7.913788       
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.086534        8.471581       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.079530        6.781807       
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.081427        10.528720      
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.080989        5.551077       
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.086534        8.471581       
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.082342        5.646751       
golden.jsonl                   0.054665        4.677221       
mnist-fsdp-fsdp2-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055944        4.042932       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055999        3.962130       
mnist-fsdp-fsdp2-sp1-fsdp1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.052857        4.031875       
mnist-fsdp-fsdp2-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055214        3.828687       
mnist-fsdp-fsdp2-sp1-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.057422        3.949810       
mnist-fsdp-fsdp-sp1-fsdp-1--use_remove_padding-False--Dynamic-bsz-True.jsonl 0.051577        4.000427       
mnist-fsdp-fsdp-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.055772        7.157363       
mnist-fsdp-fsdp-sp2-fsdp-1--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.050208        4.739748       
mnist-fsdp-fsdp2-sp1-fsdp2--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.057419        3.958420       
mnist-fsdp-fsdp-sp4-fsdp4--use_remove_padding-True--Dynamic-bsz-True.jsonl 0.053807        5.048426 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants