-
-
Notifications
You must be signed in to change notification settings - Fork 127
✨ reasoning parser support #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @Orion-zhen, thanks for PRing! I see from the issue that Deepseek does extend the OAI API in this way, so it should be ok to add this into Tabby as this looks like a standard. Can you please send a test request to make sure the PR works? I assume I'd use a deepseek distill to test this out? |
Sure! Here are test requests on none-streaming and streaming request: none-streaming: from openai import OpenAI
client = OpenAI(api_key="<whatever-key>", base_url="<local-url>")
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
model="<model-name>",
messages=messages
)
reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content
print(reasoning_content)
print(content) streaming: from openai import OpenAI
client = OpenAI(api_key="<whatever-key>", base_url="<local-url>")
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
model="<model-name>",
messages=messages,
stream=True
)
reasoning_content = ""
content = ""
for chunk in response:
if chunk.choices[0].delta.reasoning_content:
reasoning_content += chunk.choices[0].delta.reasoning_content
print(reasoning_content)
else:
content += chunk.choices[0].delta.content
print(content) |
I added reasoning start token handler, for the latest released model QwQ-32B can explicitly output |
👋 Hi there, any update? |
I would also really like for this to be merged, I tested the author's branch and it works beautifully with QwQ. |
Any reason these settings are under the network section of the config file? Should they not be model specific? |
Hi, @ladrua. Because as far as I think, it's an API level configuration. The standard API can contain |
Hi :) I see, I just notice that some models use different thinking tokens. some use |
Thank you for testing my PR! I think splitting configuration about the same topic (reasoning) into two different sections would bring inconvenience for users to setup and change configs. As for model switching, tabby provides inline switching and switching with config file. For inline switching, the config file is not changed, which means if the new loaded model uses another reasoning token, you should change it in the config file, too. For config files, I think it's better to put configs about reasoning together. |
Backreading this conversation, I agree with @ladrua here. This pattern has already been established for vision with the The network block is specifically for options pertaining to the server regarding IPs, ports, auth, etc. Reasoning doesn't seem to be a good fit here as that depends on the model. If users really want reasoning support from a non-reasoning model, then add it in the model block or inline config. |
Sure, I will change it then. |
@Orion-zhen |
Hi @Originalimoc.
|
Although current reasoning models such as R1, QwQ-32B, etc. have |
@Orion-zhen 2/ That is why this should be put into per model config "tabby_config.yml" not global, you can not check chat template this way just manually set it in this config file. |
I'm not familiar with this per model config feature. TBH, this is the first time that I hear of it. I think reasoning feature is something equal to vision feature, putting it along with |
It's just missing some field when merging, after merging it's processed by the same config code. I actually almost configured nothing in global config.yml because I use inline loading all the time and each model requires entirely different ctx length/cache mode/draft model/vision etc. |
@Orion-zhen Maybe we should check for |
In none-streaming cases, we can simply |
@ladrua Try add \n to the chat template after "<think>"?/enable token healing? |
After this PR get merged I'll create a new PR fixing the per model config loading. |
Something like this would work for the streaming bit: async def stream_generate_chat_completion(
prompt: str,
embeddings: MultimodalEmbeddingWrapper,
data: ChatCompletionRequest,
request: Request,
model_path: pathlib.Path,
):
"""Generator for the generation process."""
abort_event = asyncio.Event()
gen_queue = asyncio.Queue()
gen_tasks: List[asyncio.Task] = []
disconnect_task = asyncio.create_task(request_disconnect_loop(request))
try:
logger.info(f"Received chat completion streaming request {request.state.id}")
for n in range(0, data.n):
task_gen_params = data.model_copy(deep=True)
gen_task = asyncio.create_task(
_stream_collector(
n,
gen_queue,
prompt,
request.state.id,
abort_event,
embeddings=embeddings,
**task_gen_params.model_dump(exclude={"prompt"}),
)
)
gen_tasks.append(gen_task)
# We need to keep track of the text generated so we can resume the tool calls
current_generation_text = ""
is_reasoning_chunk = config.model.reasoning
# Add a flag to track if this is the first token
is_first_content_chunk = True
is_first_reasoning_chunk = True
# Consumer loop
while True:
if disconnect_task.done():
abort_event.set()
handle_request_disconnect(
f"Chat completion generation {request.state.id} cancelled by user."
)
generation = await gen_queue.get()
# lets only append the text if we need it for tool calls later
if data.tool_call_start and "text" in generation:
current_generation_text += generation["text"]
# check if we are running a tool model, and that we are at stop
if data.tool_call_start and "stop_str" in generation:
generations = await generate_tool_calls(
data,
[generation],
request,
current_generations=current_generation_text,
)
generation = generations[0] # We only have one generation in this case
# Stream collector will push an exception to the queue if it fails
if isinstance(generation, Exception):
raise generation
if (
unwrap(generation.get("text"), "") == config.model.reasoning_start_token
and config.model.reasoning
):
# Update reasoning chunk flag
is_reasoning_chunk = True
# And skip this token
continue
if (
unwrap(generation.get("text"), "") == config.model.reasoning_end_token
and config.model.reasoning
):
# Update reasoning chunk flag
is_reasoning_chunk = False
# And skip this token
continue
# Strip whitespace from the first token if it contains text
if (
is_first_content_chunk
and "text" in generation
and not is_reasoning_chunk
):
# Apply lstrip() to the first token
generation["text"] = generation["text"].lstrip()
is_first_content_chunk = False
if is_first_reasoning_chunk and "text" in generation and is_reasoning_chunk:
# Apply lstrip() to the first token
generation["text"] = generation["text"].lstrip()
is_first_reasoning_chunk = False
response = _create_stream_chunk(
request.state.id,
generation,
model_path.name,
is_reasoning_chunk=is_reasoning_chunk,
)
yield response.model_dump_json()
# Check if all tasks are completed
if all(task.done() for task in gen_tasks) and gen_queue.empty():
# Send a usage chunk
if data.stream_options and data.stream_options.include_usage:
usage_chunk = _create_stream_chunk(
request.state.id,
generation,
model_path.name,
is_usage_chunk=True,
)
yield usage_chunk.model_dump_json()
logger.info(
f"Finished chat completion streaming request {request.state.id}"
)
yield "[DONE]"
break
except CancelledError:
# Get out if the request gets disconnected
if not disconnect_task.done():
abort_event.set()
handle_request_disconnect("Chat completion generation cancelled by user.")
except Exception:
yield get_generator_error(
"Chat completion aborted. Please check the server console."
) Edit: Improved it a bit. |
The PR solves issue294
The PR implemented reasoning parser and streaming reasoning parser, and was tested on DeepSeek-R1-Distill-Llama-70B with frontend like Page Assist and Cherry Studio. In my tests, it worked well.
I added configuration entry points to
config.yml
and placed it undernetworkmodel section.