Skip to content

Add GLM4.1V model (Draft) #19331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

zRzRzRzRzRzRzR
Copy link
Contributor

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR commented Jun 8, 2025

This PR aims to add support for the GLM-4.1V model.
Due to the upgrade of the implementation in the transformers library, some interfaces have been changed.
This model definitely needs to sample FPS and video duration.
The related transformers PR is here.

Copy link

github-actions bot commented Jun 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here with a summary of this pull request. This PR introduces support for the GLM-4.1V multimodal model. The primary goal is to integrate this new model into vLLM, enabling inference with both image and video inputs. This required adapting the model architecture, multimodal processing logic, and specific components like positional embeddings from the upstream transformers library implementation. Key changes include adding the model definition, implementing its vision tower and multimodal processing, adding a dedicated video loader with frame sampling capabilities, and updating core vLLM components like rotary embeddings and chat utilities to handle the model's specific requirements and token formats.

Highlights

  • New Model Support: Adds the Glm4vForConditionalGeneration model to support the GLM-4.1V multimodal model.
  • Multimodal Processing: Implements the GLM-4.1V specific vision transformer, patch merger, and multimodal processing logic to handle image and video inputs.
  • Video Loading: Introduces a dedicated video loader (Glm4vVideoLoader) for GLM-4.1V that handles frame sampling and metadata extraction.
  • Positional Encoding: Adds GLM-4.1V specific logic to the rotary embedding layer to correctly calculate positions for multimodal inputs.
  • Chat Utilities: Updates chat utility functions to recognize and handle GLM-4.1V's unique image and video placeholder tokens.
  • Examples and Registry: Adds an example for offline inference with GLM-4.1V and registers the new model within vLLM.

Changelog

Click here to see the changelog
  • examples/offline_inference/vision_language.py
    • Added run_glm4_1v function to demonstrate offline inference for GLM-4.1V, including setting up engine arguments and prompt formatting for image and video.
  • tests/models/registry.py
    • Added Glm4vForConditionalGeneration to the check_available_online registry, mapping it to the Hugging Face model name 'THUDM/GLM-4.1V-9B'.
  • vllm/entrypoints/chat_utils.py
    • Added handling for glm4v model type in _placeholder_str to use the correct image and video placeholder tokens (<|begin_of_image|><|image|><|end_of_image|> and <|begin_of_video|><|video|><|end_of_video|>).
  • vllm/model_executor/layers/rotary_embedding.py
    • Imported itertools, List, Tuple.
    • Added _glm4v_get_input_positions_tensor class method to calculate rotary positional embeddings specifically for GLM-4.1V, considering image and video grid structures.
    • Integrated the call to _glm4v_get_input_positions_tensor within get_input_positions_tensor for glm4v model types.
  • vllm/model_executor/models/glm4_1v.py
    • Added new file implementing the Glm4vForConditionalGeneration model.
    • Defined input types for image and video pixel values and embeddings.
    • Implemented GLM-4.1V specific vision encoder components (Glm4vVisionMLP, Glm4vVisionAttention, Glm4vVisionPatchEmbed, Glm4vPatchMerger, Glm4vVisionEmbeddings, Glm4vVisionRotaryEmbedding, Glm4vVisionTransformer).
    • Implemented multimodal processing logic (Glm4vProcessingInfo, Glm4vDummyInputsBuilder, Glm4vMultiModalProcessor) including handling image resizing, patch merging, and prompt updates.
    • Integrated the vision tower and language model backbone.
    • Added methods for parsing multimodal inputs, processing them into embeddings, and merging them with text embeddings.
    • Implemented weight loading logic specific to GLM-4.1V.
    • Registered the model and its processor with the vLLM multimodal registry.
  • vllm/model_executor/models/registry.py
    • Added Glm4vForConditionalGeneration to the VLLM_REGISTERED_MODELS dictionary, linking it to the new glm4_1v module.
  • vllm/multimodal/video.py
    • Imported Dict and Tuple.
    • Added Glm4vVideoLoader registered under the 'glm4v' backend, using OpenCV to load video frames and metadata with support for uniform frame sampling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) labels Jun 8, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the GLM-4.1V model, which is a valuable addition. The implementation involves adding a new model file, updating the rotary embedding logic for multimodal inputs, and integrating the model into the examples and registry. Overall, the structure of the changes looks good, following the pattern for adding new multimodal models. However, I've identified a few areas, particularly in the multimodal position embedding and video handling, that require attention to ensure correctness and consistency.

Summary of Findings

  • Incorrect handling of video position IDs and grid dimensions: The logic for calculating 3D position IDs for video tokens in _glm4v_get_input_positions_tensor appears to incorrectly use image grid dimensions for video height and width, and the temporal counter logic seems flawed for videos with multiple tokens.
  • Inconsistent video metadata: The _parse_and_validate_video_input function hardcodes FPS and video backend in the metadata, which is inconsistent with the Glm4vVideoLoader that samples these values from the video file.
  • Type mismatch in vision embeddings forward pass: The Glm4vVisionTransformer.forward method passes batch sequence lengths (seqlens) to Glm4vVisionEmbeddings.forward, which expects sequence lengths per image item (lengths).
  • Potential issue in v0 multimodal embedding merging: The v0 compatibility method get_input_embeddings_v0 merges image and video embeddings sequentially, which might be incorrect for interleaved multimodal inputs compared to the simultaneous merging approach in the v1 method.
  • Strict assertion in rotary embedding: An assertion in _glm4v_get_input_positions_tensor requires text segments to have a length greater than 0, which might be too strict.
  • Hardcoded FPS in example script: The example script hardcodes FPS in the mm_processor_kwargs, which might conflict with the model's actual video processing logic.
  • Misleading limit_mm_per_prompt in example script: The example script sets an image-specific limit for a function designed to handle both image and video.

Merge Readiness

This pull request introduces support for a new model and includes significant changes to the multimodal infrastructure. While the overall structure is good, there are several high-severity issues related to the correctness of multimodal position encoding and video metadata handling that need to be addressed before merging. I am unable to approve this pull request, and others should review and approve this code before merging.

Comment on lines +1219 to +1222
video_frame_num,
image_grid_thw[mm_data_idx][1],
image_grid_thw[mm_data_idx][2],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the video processing block, the temporal dimension t is set to video_frame_num (line 1219), and the height h and width w are taken from image_grid_thw (lines 1220-1221). This seems incorrect for video. The height and width should likely come from video_grid_thw instead of image_grid_thw.

                    t, h, w = (
                        video_frame_num,
                        video_grid_thw[mm_data_idx][1],
                        video_grid_thw[mm_data_idx][2],
                    )

torch.stack([t_index, h_index, w_index]) + st_idx)

mm_data_idx += 1
video_frame_num += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The video_frame_num is incremented by 1 after processing a video block. If a single video item consists of multiple 'video' tokens (e.g., representing different frames or chunks), this logic might be incorrect. The temporal dimension t for a video item should likely be derived from the video's properties (like number of frames or temporal patches) rather than a simple counter that increments per video block.

Comment on lines +1257 to +1258
mrope_position_delta = (llm_positions.max() + 1 -
len(input_tokens)).item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of mrope_position_delta as (llm_positions.max() + 1 - len(input_tokens)).item() seems specific to how GLM4V handles positions. Could you confirm if this delta calculation is correct and aligns with the model's expected positional encoding behavior, especially considering the 3D position IDs?

Comment on lines +736 to +737
x = self.embeddings(x, seqlens, grid_thw, image_type_ids[:, 0],
image_type_ids[:, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In Glm4vVisionTransformer.forward, self.embeddings is called with seqlens as the second argument (lengths). However, Glm4vVisionEmbeddings.adapt_position_encoding expects lengths to be a list or tensor of sequence lengths for each image ([batch_size]). seqlens is a list of sequence lengths for the batch of tokens used in attention, which is derived from cu_seqlens. This seems like a type mismatch and potential correctness issue. The lengths argument for position encoding should likely be derived from grid_thw (which has shape (num_images, 3)).

Comment on lines +1195 to +1199
fps=2.0,
duration=num_frames / 2.0,
total_num_frames=num_frames,
video_backend="pyav",
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _parse_and_validate_video_input, the video_metadata is constructed by hardcoding fps=2.0 and video_backend="pyav". However, the Glm4vVideoLoader samples the actual FPS and duration from the video file. This hardcoding seems inconsistent with the loader and might lead to incorrect metadata being passed if the actual video properties differ.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to parse VideoMetadata to Glm4vVideoPixelInputs? Seems it's not used by model forwarding.

"max_pixels": 1280 * 28 * 28,
"fps": 1,
},
limit_mm_per_prompt={"image": 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The limit_mm_per_prompt is set to {"image": 1} here. However, the run_glm4_1v function is designed to handle both image and video modalities. If video input is used with this function, this limit might be incorrect or misleading. Should this limit be configurable based on the modality used, or should it allow for both image and video if the model supports both?

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@zRzRzRzRzRzRzR
Copy link
Contributor Author

I noticed the part where the video is read in. The original video reading uses OpenCV but does not return the frame rate. Our model definitely needs FPS as a sampling parameter, so I separately implemented Glm4vVideoLoader. I don't know if this is a good approach, or if you will modify OpenCVVideoBackend in future updates to obtain other information needed in transformers, which is included in from transformers.video_utils import VideoMetadata, implemented in the latest original code of transformers.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some initial comments, PTAL :)

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@@ -427,6 +427,9 @@ def _parse_video_data(
if self._is_embeddings(data):
return VideoEmbeddingItems(data)

if isinstance(data, tuple) and len(data) == 2:
frames, metadata = data
data = frames
Copy link
Member

@DarkLight1337 DarkLight1337 Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata should be passed to VideoProcessorItems as well so it can be used by the HF processor (preferably via get_processor_data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but make sure the metadata is passed to HF processor correctly as well

Copy link
Contributor Author

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it is working, I checked that the HF Processor passed in is working, but there is an issue. This feature is only supported in the current version of the transformers source code, which means the next version of transformers will support it. I contacted their staff, and they said this feature is still being perfected. I know that this class may not be supported in some older models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glm4.1V will also be supported in the next version of the transformers implementation, and the transformers team is expected to update the code improvements regarding processors this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Perhaps we should add a transformers version check to get_processor_data to maintain back-compatibility then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I happened to encounter their update during implementation. The adaptation of the GLM4.1v model will directly adapt to the new transformers interface, so I am not familiar with what impact the previous versions of transformers, such as 4.52 and earlier, will have on video understanding models with this improvement.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Jun 9, 2025

I suggest merging this PR after the transformers team has confirmed their version merge, to ensure that the main branch of transformers is stable.(This Week)

@zRzRzRzRzRzRzR
Copy link
Contributor Author

In the latest version of transformers, images and videos are processed separately, with image processors and video processors respectively. Is there any related implementation for reference in vLLM?

In the current implementation, I've noticed that the video processor is never called. The GLM-4.1V model processes videos by converting them into multiple image frames after sampling. However, I've found that only one frame is passed to vLLM each time, while transformers work normally. Do you know in which files the video processing logic is located?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 23, 2025

We use the parent processor that is constructed from AutoProcessor in vLLM. So if only the image processor is being called, I suggest inspecting the inputs to AutoProcessor.__call__ to figure out why.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Jun 23, 2025

I observed that the HF video processor was called, but in the return from the HF processor:

return BatchFeature(data=data, tensor_type=return_tensors)
all = ["Glm4vVideoProcessor"]

both pixel_values_videos and video_grid_thw do not align with the kwargs expected in vLLM.
Below is the print output from transformers right before the return:

pixel_values_videos: tensor([[ 1.4632,  1.4632,  1.4632,  ...,  2.0890,  2.0890,  2.0890],
        [ 1.4778,  1.4778,  1.4778,  ...,  2.0606,  2.0606,  2.0606],
        [ 1.5070,  1.5070,  1.5070,  ...,  2.0890,  2.0890,  2.0890],
        ...,
        [ 0.6457,  0.6749,  0.6749,  ..., -0.6412, -0.6412, -0.6412],
        [ 0.6457,  0.6311,  0.6311,  ..., -0.7977, -0.7692, -0.7692],
        [ 0.7041,  0.7479,  0.7041,  ..., -0.6697, -0.6981, -0.7123]]) with shape(torch.Size([19136, 1176])
        
video_grid_thw:[[1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46]]

But in vLLM, the output is printed as

tensor([[[ 1.4609,  1.4609,  1.4609,  ...,  2.0938,  2.0938,  2.0938],
         [ 1.4766,  1.4766,  1.4766,  ...,  2.0625,  2.0625,  2.0625],
         [ 1.5078,  1.5078,  1.5078,  ...,  2.0938,  2.0938,  2.0938],
         ...,
         [ 0.7930,  0.7930,  0.8359,  ...,  0.9648,  0.9648,  0.9531],
         [-0.1572,  0.0033,  0.2520,  ..., -0.4141, -0.3711, -0.4004],
         [ 0.3398,  0.3105,  0.2656,  ..., -0.5000, -0.5000, -0.3574]]], with shape(torch.Size([299, 1176])
       device='cuda:0', dtype=torch.bfloat16)
tensor([[[ 1, 26, 46]]], device='cuda:0')

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 23, 2025

Are the inputs to __call__ the same?

@zRzRzRzRzRzRzR
Copy link
Contributor Author

Same, I printed

class Glm4vMultiModalProcessor(BaseMultiModalProcessor[Glm4vProcessingInfo]):

    def _call_hf_processor(

The return result is the same as HF's.

@DarkLight1337
Copy link
Member

Can you try to disable the multimodal preprocessing cache and see if the results are still different? Perhaps HF does some padding based on the other inputs in the same batch

@zRzRzRzRzRzRzR
Copy link
Contributor Author

How to operate this. Also, currently batch is set to 1. GLM's video grid is different from Qwen, it is a list containing multiple image grids.
Additionally, I suspect that the issue is not in glm4_1.py, but in the functions it inherits that are not explicitly written in this file.

@DarkLight1337
Copy link
Member

You can set --disable-mm-preprocessor-cache

@zRzRzRzRzRzRzR
Copy link
Contributor Author

still diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants