OpenAI compatible server: tokenizer arg causes issues with pooling resources amongst models #7815

thealmightygrant · 2024-11-19T17:38:37Z

thealmightygrant
Nov 19, 2024

Hi y'all, I was testing out the OpenAI compatible server, and I saw that we can include the tokenizer directly in the vllm_backend using the model.json. Could we remove it from being a required argument for the OpenAI server?

My thought here is that this opens up serving multiple models from the same server, even if those models use different tokenizers.

akshayqylis · 2025-02-19T09:02:44Z

akshayqylis
Feb 19, 2025

Hi, I am also interested in managing multiple LLM models using the Triton-vLLM Docker image. This can be done for non-LLM models using the nvcr.io/nvidia/tritonserver:<yy.mm>-py3 series of Docker images. To load and unload models we can use requests like the following:
curl --location --request POST 'http://localhost:8000/v2/repository/models//load'
curl --location --request POST 'http://localhost:8000/v2/repository/models//unload'

Did you find any solution for this? Any reply from the Triton team?

0 replies

nnshah1 · 2025-02-19T15:10:09Z

nnshah1
Feb 19, 2025
Maintainer

@rmccorm4 - is the tokenizer optional or required for vllm models?

0 replies

thealmightygrant · 2025-02-20T15:56:38Z

thealmightygrant
Feb 20, 2025
Author

I had no reply from the Triton team. We're still currently using vLLM without Triton in production for our use cases, but I'm hopeful in the long term to switch over if they prioritize some things like this.

0 replies

thealmightygrant · 2025-02-20T15:58:48Z

thealmightygrant
Feb 20, 2025
Author

In general, we might stick with vLLM or TGI from HuggingFace. Triton doesn't necessarily seem like a priority to Nvidia, and it doesn't have the community backing it that vLLM has.

0 replies

rmccorm4 · 2025-02-21T18:34:02Z

rmccorm4
Feb 21, 2025
Collaborator

Hi @thealmightygrant, thanks for sharing the feedback.

Currently, the tokenizer arg on the OpenAI frontend is only used for applying the chat template (v1/chat/completions), so if you were using the legacy completions endpoint (v1/completions), then the tokenizer arg can be omitted altogether.

Assuming you want to use chat completions, then yes you are spot on that there is only support for a single --tokenizer arg at startup time and is not ideal for serving multiple models concurrently. I agree we can improve here to support a mapping between models:tokenizers rather than 1 static tokenizer, and further improve to try to simplify/automate detection of the tokenizer from the respective model config when available.

CC @richardhuo-nv for viz - this may be similar to your recent work on Triton CLI attempting something similar, but adding the detection logic into the openai frontend directly instead - this should improve the UX.

Also @thealmightygrant if you're interested in contributing any improvements, please let us know.

1 reply

richardhuo-nv Feb 21, 2025
Collaborator

Yeah, agree. I will fill an work item to detect and warm up all tokenizers, as well as setting a default one for those don't have a chat_template.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenAI compatible server: tokenizer arg causes issues with pooling resources amongst models #7815

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

OpenAI compatible server: tokenizer arg causes issues with pooling resources amongst models #7815

Uh oh!

thealmightygrant Nov 19, 2024

Replies: 5 comments · 1 reply

Uh oh!

akshayqylis Feb 19, 2025

Uh oh!

nnshah1 Feb 19, 2025 Maintainer

Uh oh!

thealmightygrant Feb 20, 2025 Author

Uh oh!

thealmightygrant Feb 20, 2025 Author

Uh oh!

Uh oh!

rmccorm4 Feb 21, 2025 Collaborator

Uh oh!

richardhuo-nv Feb 21, 2025 Collaborator

thealmightygrant
Nov 19, 2024

Replies: 5 comments 1 reply

akshayqylis
Feb 19, 2025

nnshah1
Feb 19, 2025
Maintainer

thealmightygrant
Feb 20, 2025
Author

thealmightygrant
Feb 20, 2025
Author

rmccorm4
Feb 21, 2025
Collaborator

richardhuo-nv Feb 21, 2025
Collaborator