Open
Description
Anything you want to discuss about vllm.
I have a custom model that uses something really similar to MoE but instead of that routing being determined by the model itself, I set routing dependent on what part of the sequence the tokens are from.
I would now like to inference my model using VLLM, and because I have a non-standard model I assume I will need to modify some parts of VLLM to get it to work. It would be really nice if someone more familiar with the project can give me some feedback on my implementation thoughts and maybe can point me to the right places in the code base.
- I really would like to have batching support where each batch item can have a different routing map.
- I will most likely need a custom sampler, which is capable of determining the routing pattern / setting the next expert, based on the sequence.
- I assume that I would need to CUDA graph cache each routing pattern / combination, which sounds like a lot. How is this handled with other MoE models? Most likely I want to at first deactivate this.
Thank you for all your help in advance and with best regards
XMaster96
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.