-
Notifications
You must be signed in to change notification settings - Fork 765
Open
Description
System Info / 系統信息
xinference v0.13.2
其中vllm并不支持batching inference。使用openai的batching prompts就会报500错误。
为什么不参考
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_completion.py
https://github.com/vllm-project/vllm/blob/461089a21a5b00d6c6712e3bf371ce2d9cfa0860/vllm/entrypoints/openai/serving_completion.py#L110
的实现,实现batching的功能。
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- docker / docker
- pip install / 通过 pip install 安装
- installation from source / 从源码安装
Version info / 版本信息
xinference v0.13.2
The command used to start Xinference / 用以启动 xinference 的命令
opt/conda/bin/python /opt/conda/bin/xinference-worker --metrics-exporter-port 9998 -e http://10.6.208.95:9997 -H 10.6.208.95
Reproduction / 复现过程
from openai import OpenAI
client = OpenAI()
num_stories = 10
prompts = ["Once upon a time,"] * num_stories
# batched example, with 10 story completions per request
response = client.completions.create(
model="qwen2",
prompt=prompts,
max_tokens=20,
)
# match completions to prompts by index
stories = [""] * len(prompts)
for choice in response.choices:
stories[choice.index] = prompts[choice.index] + choice.text
# print stories
for story in stories:
print(story)Expected behavior / 期待表现
as https://platform.openai.com/docs/guides/rate-limits/error-mitigation described