Skip to content

how to enable batch inference with embedding/rerank model? #1779

@SunLemuria

Description

@SunLemuria

问题描述

用UI启动的embedding/rerank模型,没有并发相关的设置
客户端用asyncio、concurrent.futures方式发送请求,速度竟然比同步的for loop还慢
应该怎么能使模型并发推理?

xinference侧启动的模型

embedding:
image
rerank:
image

测试结果

embedding接口测试: model.create_embedding(text)

测试结果:
image

rerank接口测试:model.rerank(corpus, query)

image

langchain的XinferenceEmbeddings接口测试

接口:

from langchain_community.embeddings import XinferenceEmbeddings
xinference = XinferenceEmbeddings(
                server_url=xinference_url, model_uid="bge-m3"
            )
local_kb_xin = FAISS.load_local(
    "../data/vector_store/vector-bge-m3", embeddings=xinference, allow_dangerous_deserialization=True)
local_kb_xin.similarity_search(query=query, include_metadata=True, k=30)

测试结果:
image

另外

测试embedding接口,文本长或短,用时都是for loop最快
rerank受长度影响比较大,上面的结果都是query很短,corpus很长的情况(应用中的实际情况),如果是 https://inference.readthedocs.io/zh-cn/latest/user_guide/client_api.html#rerank中的示例,测试结果如下:
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions