how to enable batch inference with embedding/rerank model?

## 问题描述

用UI启动的embedding/rerank模型，没有并发相关的设置
客户端用asyncio、concurrent.futures方式发送请求，速度竟然比同步的for loop还慢
**应该怎么能使模型并发推理？**

## xinference侧启动的模型
embedding:
<img width="1129" alt="image" src="https://github.com/xorbitsai/inference/assets/37478528/602c8ddc-c494-48e7-946b-a7735c2933ac">
rerank:
<img width="1121" alt="image" src="https://github.com/xorbitsai/inference/assets/37478528/f6ea0529-38cd-4941-84cd-cdd012f62957">

## 测试结果

### embedding接口测试: model.create_embedding(text)
测试结果：
<img width="603" alt="image" src="https://github.com/xorbitsai/inference/assets/37478528/0c4f351d-60bd-4979-90d9-3b1202be7f2d">

### rerank接口测试：model.rerank(corpus, query)
<img width="591" alt="image" src="https://github.com/xorbitsai/inference/assets/37478528/bb384c24-cdd5-4a53-89ee-0722ec8be802">

### langchain的XinferenceEmbeddings接口测试
接口：
```python
from langchain_community.embeddings import XinferenceEmbeddings
xinference = XinferenceEmbeddings(
                server_url=xinference_url, model_uid="bge-m3"
            )
local_kb_xin = FAISS.load_local(
    "../data/vector_store/vector-bge-m3", embeddings=xinference, allow_dangerous_deserialization=True)
local_kb_xin.similarity_search(query=query, include_metadata=True, k=30)
```
测试结果：
<img width="654" alt="image" src="https://github.com/xorbitsai/inference/assets/37478528/d9ccd8ab-8cd3-4919-a63e-30f6e9feb8f7">

## 另外
测试embedding接口，文本长或短，用时都是for loop最快
rerank受长度影响比较大，上面的结果都是query很短，corpus很长的情况（应用中的实际情况），如果是 [https://inference.readthedocs.io/zh-cn/latest/user_guide/client_api.html#rerank](https://inference.readthedocs.io/zh-cn/latest/user_guide/client_api.html#rerank)中的示例，测试结果如下：
<img width="580" alt="image" src="https://github.com/xorbitsai/inference/assets/37478528/f87537fe-b542-44ad-812d-31c8ecd40f54">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to enable batch inference with embedding/rerank model? #1779

问题描述

xinference侧启动的模型

测试结果

embedding接口测试: model.create_embedding(text)

rerank接口测试：model.rerank(corpus, query)

langchain的XinferenceEmbeddings接口测试

另外

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

how to enable batch inference with embedding/rerank model? #1779

Description

问题描述

xinference侧启动的模型

测试结果

embedding接口测试: model.create_embedding(text)

rerank接口测试：model.rerank(corpus, query)

langchain的XinferenceEmbeddings接口测试

另外

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions