OpenAI compatible server: tokenizer arg causes issues with pooling resources amongst models #7815
Replies: 5 comments 1 reply
-
Hi, I am also interested in managing multiple LLM models using the Triton-vLLM Docker image. This can be done for non-LLM models using the nvcr.io/nvidia/tritonserver:<yy.mm>-py3 series of Docker images. To load and unload models we can use requests like the following: Did you find any solution for this? Any reply from the Triton team? |
Beta Was this translation helpful? Give feedback.
-
@rmccorm4 - is the tokenizer optional or required for vllm models? |
Beta Was this translation helpful? Give feedback.
-
I had no reply from the Triton team. We're still currently using vLLM without Triton in production for our use cases, but I'm hopeful in the long term to switch over if they prioritize some things like this. |
Beta Was this translation helpful? Give feedback.
-
In general, we might stick with vLLM or TGI from HuggingFace. Triton doesn't necessarily seem like a priority to Nvidia, and it doesn't have the community backing it that vLLM has. |
Beta Was this translation helpful? Give feedback.
-
Hi @thealmightygrant, thanks for sharing the feedback. Currently, the tokenizer arg on the OpenAI frontend is only used for applying the chat template ( Assuming you want to use chat completions, then yes you are spot on that there is only support for a single --tokenizer arg at startup time and is not ideal for serving multiple models concurrently. I agree we can improve here to support a mapping between models:tokenizers rather than 1 static tokenizer, and further improve to try to simplify/automate detection of the tokenizer from the respective model config when available. CC @richardhuo-nv for viz - this may be similar to your recent work on Triton CLI attempting something similar, but adding the detection logic into the openai frontend directly instead - this should improve the UX. Also @thealmightygrant if you're interested in contributing any improvements, please let us know. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi y'all, I was testing out the OpenAI compatible server, and I saw that we can include the tokenizer directly in the vllm_backend using the model.json. Could we remove it from being a required argument for the OpenAI server?
My thought here is that this opens up serving multiple models from the same server, even if those models use different tokenizers.
Beta Was this translation helpful? Give feedback.
All reactions