-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Add GLM4.1V model (Draft) #19331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GLM4.1V model (Draft) #19331
Conversation
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here with a summary of this pull request. This PR introduces support for the GLM-4.1V multimodal model. The primary goal is to integrate this new model into vLLM, enabling inference with both image and video inputs. This required adapting the model architecture, multimodal processing logic, and specific components like positional embeddings from the upstream transformers
library implementation. Key changes include adding the model definition, implementing its vision tower and multimodal processing, adding a dedicated video loader with frame sampling capabilities, and updating core vLLM components like rotary embeddings and chat utilities to handle the model's specific requirements and token formats.
Highlights
- New Model Support: Adds the
Glm4vForConditionalGeneration
model to support the GLM-4.1V multimodal model. - Multimodal Processing: Implements the GLM-4.1V specific vision transformer, patch merger, and multimodal processing logic to handle image and video inputs.
- Video Loading: Introduces a dedicated video loader (
Glm4vVideoLoader
) for GLM-4.1V that handles frame sampling and metadata extraction. - Positional Encoding: Adds GLM-4.1V specific logic to the rotary embedding layer to correctly calculate positions for multimodal inputs.
- Chat Utilities: Updates chat utility functions to recognize and handle GLM-4.1V's unique image and video placeholder tokens.
- Examples and Registry: Adds an example for offline inference with GLM-4.1V and registers the new model within vLLM.
Changelog
Click here to see the changelog
- examples/offline_inference/vision_language.py
- Added
run_glm4_1v
function to demonstrate offline inference for GLM-4.1V, including setting up engine arguments and prompt formatting for image and video.
- Added
- tests/models/registry.py
- Added
Glm4vForConditionalGeneration
to thecheck_available_online
registry, mapping it to the Hugging Face model name 'THUDM/GLM-4.1V-9B'.
- Added
- vllm/entrypoints/chat_utils.py
- Added handling for
glm4v
model type in_placeholder_str
to use the correct image and video placeholder tokens (<|begin_of_image|><|image|><|end_of_image|>
and<|begin_of_video|><|video|><|end_of_video|>
).
- Added handling for
- vllm/model_executor/layers/rotary_embedding.py
- Imported
itertools
,List
,Tuple
. - Added
_glm4v_get_input_positions_tensor
class method to calculate rotary positional embeddings specifically for GLM-4.1V, considering image and video grid structures. - Integrated the call to
_glm4v_get_input_positions_tensor
withinget_input_positions_tensor
forglm4v
model types.
- Imported
- vllm/model_executor/models/glm4_1v.py
- Added new file implementing the
Glm4vForConditionalGeneration
model. - Defined input types for image and video pixel values and embeddings.
- Implemented GLM-4.1V specific vision encoder components (
Glm4vVisionMLP
,Glm4vVisionAttention
,Glm4vVisionPatchEmbed
,Glm4vPatchMerger
,Glm4vVisionEmbeddings
,Glm4vVisionRotaryEmbedding
,Glm4vVisionTransformer
). - Implemented multimodal processing logic (
Glm4vProcessingInfo
,Glm4vDummyInputsBuilder
,Glm4vMultiModalProcessor
) including handling image resizing, patch merging, and prompt updates. - Integrated the vision tower and language model backbone.
- Added methods for parsing multimodal inputs, processing them into embeddings, and merging them with text embeddings.
- Implemented weight loading logic specific to GLM-4.1V.
- Registered the model and its processor with the vLLM multimodal registry.
- Added new file implementing the
- vllm/model_executor/models/registry.py
- Added
Glm4vForConditionalGeneration
to theVLLM_REGISTERED_MODELS
dictionary, linking it to the newglm4_1v
module.
- Added
- vllm/multimodal/video.py
- Imported
Dict
andTuple
. - Added
Glm4vVideoLoader
registered under the 'glm4v' backend, using OpenCV to load video frames and metadata with support for uniform frame sampling.
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the GLM-4.1V model, which is a valuable addition. The implementation involves adding a new model file, updating the rotary embedding logic for multimodal inputs, and integrating the model into the examples and registry. Overall, the structure of the changes looks good, following the pattern for adding new multimodal models. However, I've identified a few areas, particularly in the multimodal position embedding and video handling, that require attention to ensure correctness and consistency.
Summary of Findings
- Incorrect handling of video position IDs and grid dimensions: The logic for calculating 3D position IDs for video tokens in
_glm4v_get_input_positions_tensor
appears to incorrectly use image grid dimensions for video height and width, and the temporal counter logic seems flawed for videos with multiple tokens. - Inconsistent video metadata: The
_parse_and_validate_video_input
function hardcodes FPS and video backend in the metadata, which is inconsistent with theGlm4vVideoLoader
that samples these values from the video file. - Type mismatch in vision embeddings forward pass: The
Glm4vVisionTransformer.forward
method passes batch sequence lengths (seqlens
) toGlm4vVisionEmbeddings.forward
, which expects sequence lengths per image item (lengths
). - Potential issue in v0 multimodal embedding merging: The v0 compatibility method
get_input_embeddings_v0
merges image and video embeddings sequentially, which might be incorrect for interleaved multimodal inputs compared to the simultaneous merging approach in the v1 method. - Strict assertion in rotary embedding: An assertion in
_glm4v_get_input_positions_tensor
requires text segments to have a length greater than 0, which might be too strict. - Hardcoded FPS in example script: The example script hardcodes FPS in the
mm_processor_kwargs
, which might conflict with the model's actual video processing logic. - Misleading
limit_mm_per_prompt
in example script: The example script sets an image-specific limit for a function designed to handle both image and video.
Merge Readiness
This pull request introduces support for a new model and includes significant changes to the multimodal infrastructure. While the overall structure is good, there are several high-severity issues related to the correctness of multimodal position encoding and video metadata handling that need to be addressed before merging. I am unable to approve this pull request, and others should review and approve this code before merging.
video_frame_num, | ||
image_grid_thw[mm_data_idx][1], | ||
image_grid_thw[mm_data_idx][2], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the video processing block, the temporal dimension t
is set to video_frame_num
(line 1219), and the height h
and width w
are taken from image_grid_thw
(lines 1220-1221). This seems incorrect for video. The height and width should likely come from video_grid_thw
instead of image_grid_thw
.
t, h, w = (
video_frame_num,
video_grid_thw[mm_data_idx][1],
video_grid_thw[mm_data_idx][2],
)
torch.stack([t_index, h_index, w_index]) + st_idx) | ||
|
||
mm_data_idx += 1 | ||
video_frame_num += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The video_frame_num
is incremented by 1 after processing a video block. If a single video item consists of multiple 'video' tokens (e.g., representing different frames or chunks), this logic might be incorrect. The temporal dimension t
for a video item should likely be derived from the video's properties (like number of frames or temporal patches) rather than a simple counter that increments per video block.
mrope_position_delta = (llm_positions.max() + 1 - | ||
len(input_tokens)).item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of mrope_position_delta
as (llm_positions.max() + 1 - len(input_tokens)).item()
seems specific to how GLM4V handles positions. Could you confirm if this delta calculation is correct and aligns with the model's expected positional encoding behavior, especially considering the 3D position IDs?
x = self.embeddings(x, seqlens, grid_thw, image_type_ids[:, 0], | ||
image_type_ids[:, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Glm4vVisionTransformer.forward
, self.embeddings
is called with seqlens
as the second argument (lengths
). However, Glm4vVisionEmbeddings.adapt_position_encoding
expects lengths
to be a list or tensor of sequence lengths for each image ([batch_size]
). seqlens
is a list of sequence lengths for the batch of tokens used in attention, which is derived from cu_seqlens
. This seems like a type mismatch and potential correctness issue. The lengths
argument for position encoding should likely be derived from grid_thw
(which has shape (num_images, 3)
).
fps=2.0, | ||
duration=num_frames / 2.0, | ||
total_num_frames=num_frames, | ||
video_backend="pyav", | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _parse_and_validate_video_input
, the video_metadata
is constructed by hardcoding fps=2.0
and video_backend="pyav"
. However, the Glm4vVideoLoader
samples the actual FPS and duration from the video file. This hardcoding seems inconsistent with the loader and might lead to incorrect metadata being passed if the actual video properties differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to parse VideoMetadata
to Glm4vVideoPixelInputs
? Seems it's not used by model forwarding.
"max_pixels": 1280 * 28 * 28, | ||
"fps": 1, | ||
}, | ||
limit_mm_per_prompt={"image": 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit_mm_per_prompt
is set to {"image": 1}
here. However, the run_glm4_1v
function is designed to handle both image
and video
modalities. If video input is used with this function, this limit might be incorrect or misleading. Should this limit be configurable based on the modality used, or should it allow for both image and video if the model supports both?
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
I noticed the part where the video is read in. The original video reading uses OpenCV but does not return the frame rate. Our model definitely needs FPS as a sampling parameter, so I separately implemented Glm4vVideoLoader. I don't know if this is a good approach, or if you will modify OpenCVVideoBackend in future updates to obtain other information needed in transformers, which is included in from transformers.video_utils import VideoMetadata, implemented in the latest original code of transformers. |
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some initial comments, PTAL :)
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
vllm/multimodal/parse.py
Outdated
@@ -427,6 +427,9 @@ def _parse_video_data( | |||
if self._is_embeddings(data): | |||
return VideoEmbeddingItems(data) | |||
|
|||
if isinstance(data, tuple) and len(data) == 2: | |||
frames, metadata = data | |||
data = frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata should be passed to VideoProcessorItems
as well so it can be used by the HF processor (preferably via get_processor_data
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but make sure the metadata is passed to HF processor correctly as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, it is working, I checked that the HF Processor passed in is working, but there is an issue. This feature is only supported in the current version of the transformers source code, which means the next version of transformers will support it. I contacted their staff, and they said this feature is still being perfected. I know that this class may not be supported in some older models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glm4.1V will also be supported in the next version of the transformers implementation, and the transformers team is expected to update the code improvements regarding processors this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Perhaps we should add a transformers version check to get_processor_data
to maintain back-compatibility then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I happened to encounter their update during implementation. The adaptation of the GLM4.1v model will directly adapt to the new transformers interface, so I am not familiar with what impact the previous versions of transformers, such as 4.52 and earlier, will have on video understanding models with this improvement.
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
I suggest merging this PR after the transformers team has confirmed their version merge, to ensure that the main branch of transformers is stable.(This Week) |
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
In the latest version of transformers, images and videos are processed separately, with image processors and video processors respectively. Is there any related implementation for reference in vLLM? In the current implementation, I've noticed that the video processor is never called. The GLM-4.1V model processes videos by converting them into multiple image frames after sampling. However, I've found that only one frame is passed to vLLM each time, while transformers work normally. Do you know in which files the video processing logic is located? |
We use the parent processor that is constructed from |
I observed that the HF video processor was called, but in the return from the HF processor: return BatchFeature(data=data, tensor_type=return_tensors) both
But in vLLM, the output is printed as
|
Are the inputs to |
Same, I printed
The return result is the same as HF's. |
Can you try to disable the multimodal preprocessing cache and see if the results are still different? Perhaps HF does some padding based on the other inputs in the same batch |
How to operate this. Also, currently batch is set to 1. GLM's video grid is different from Qwen, it is a list containing multiple image grids. |
You can set |
still diff |
This PR aims to add support for the GLM-4.1V model.
Due to the upgrade of the implementation in the transformers library, some interfaces have been changed.
This model definitely needs to sample FPS and video duration.
The related transformers PR is here.