Skip to content

vllm 不支持batching inference #1925

@bstr9

Description

@bstr9

System Info / 系統信息

xinference v0.13.2
其中vllm并不支持batching inference。使用openai的batching prompts就会报500错误。

为什么不参考
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_completion.py
https://github.com/vllm-project/vllm/blob/461089a21a5b00d6c6712e3bf371ce2d9cfa0860/vllm/entrypoints/openai/serving_completion.py#L110
的实现,实现batching的功能。

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

xinference v0.13.2

The command used to start Xinference / 用以启动 xinference 的命令

opt/conda/bin/python /opt/conda/bin/xinference-worker --metrics-exporter-port 9998 -e http://10.6.208.95:9997 -H 10.6.208.95

Reproduction / 复现过程

from openai import OpenAI
client = OpenAI()
 
num_stories = 10
prompts = ["Once upon a time,"] * num_stories
 
# batched example, with 10 story completions per request
response = client.completions.create(
    model="qwen2",
    prompt=prompts,
    max_tokens=20,
)
 
# match completions to prompts by index
stories = [""] * len(prompts)
for choice in response.choices:
    stories[choice.index] = prompts[choice.index] + choice.text
 
# print stories
for story in stories:
    print(story)

Expected behavior / 期待表现

as https://platform.openai.com/docs/guides/rate-limits/error-mitigation described

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions