✨ reasoning parser support #295

Orion-zhen · 2025-03-01T05:08:08Z

The PR solves issue294

The PR implemented reasoning parser and streaming reasoning parser, and was tested on DeepSeek-R1-Distill-Llama-70B with frontend like Page Assist and Cherry Studio. In my tests, it worked well.

I added configuration entry points to config.yml and placed it under ~~network~~ model section.

kingbri1 · 2025-03-05T21:54:27Z

Hey @Orion-zhen, thanks for PRing! I see from the issue that Deepseek does extend the OAI API in this way, so it should be ok to add this into Tabby as this looks like a standard.

Can you please send a test request to make sure the PR works? I assume I'd use a deepseek distill to test this out?

Orion-zhen · 2025-03-06T00:41:22Z

Sure! Here are test requests on none-streaming and streaming request:

none-streaming:

from openai import OpenAI
client = OpenAI(api_key="<whatever-key>", base_url="<local-url>")

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
    model="<model-name>",
    messages=messages
)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print(reasoning_content)
print(content)

streaming:

from openai import OpenAI
client = OpenAI(api_key="<whatever-key>", base_url="<local-url>")

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
    model="<model-name>",
    messages=messages,
    stream=True
)

reasoning_content = ""
content = ""

for chunk in response:
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content
        print(reasoning_content)
    else:
        content += chunk.choices[0].delta.content
        print(content)

Orion-zhen · 2025-03-06T03:44:20Z

I added reasoning start token handler, for the latest released model QwQ-32B can explicitly output <think> token.

Orion-zhen · 2025-03-15T13:43:28Z

👋 Hi there, any update?

InfernalDread · 2025-03-15T20:49:58Z

I would also really like for this to be merged, I tested the author's branch and it works beautifully with QwQ.

ladrua · 2025-03-18T14:01:47Z

Any reason these settings are under the network section of the config file? Should they not be model specific?

Orion-zhen · 2025-03-18T14:30:31Z

Any reason these settings are under the network section of the config file? Should they not be model specific?

Hi, @ladrua. Because as far as I think, it's an API level configuration. The standard API can contain reasoning_content even if the model is not a reasoning model. For example, you can use system prompt to force the model to think with <think> and </think> tokens, which should be regarded as reasoning_content as well.

ladrua · 2025-03-18T14:38:36Z

Hi :) I see, I just notice that some models use different thinking tokens. some use thinking(plural) some use reasoning, and since tabby supports model switching, would it not make sense to at least have the thinking tokens defined on a per model basis? Anyway, I have just tested your PR and its working great with qwq-32b. Thank you.

Orion-zhen · 2025-03-18T14:48:09Z

Hi :) I see, I just notice that some models use different thinking tokens. some use thinking(plural) some use reasoning, and since tabby supports model switching, would it not make sense to at least have the thinking tokens defined on a per model basis? Anyway, I have just tested your PR and its working great with qwq-32b. Thank you.

Thank you for testing my PR! I think splitting configuration about the same topic (reasoning) into two different sections would bring inconvenience for users to setup and change configs. As for model switching, tabby provides inline switching and switching with config file. For inline switching, the config file is not changed, which means if the new loaded model uses another reasoning token, you should change it in the config file, too. For config files, I think it's better to put configs about reasoning together.

kingbri1 · 2025-03-18T15:02:13Z

Backreading this conversation, I agree with @ladrua here. This pattern has already been established for vision with the vision: true flag in config.

The network block is specifically for options pertaining to the server regarding IPs, ports, auth, etc. Reasoning doesn't seem to be a good fit here as that depends on the model. If users really want reasoning support from a non-reasoning model, then add it in the model block or inline config.

Orion-zhen · 2025-03-18T15:05:55Z

Sure, I will change it then.

Orion-zhen · 2025-03-18T15:19:42Z

@ladrua and @kingbri1, thank you for your insights.

Originalimoc · 2025-03-19T04:03:54Z

@Orion-zhen
Can this now be configured in tabby_config.yml? That is where per model config resides.
"reasoning" set in global but "reasoning_start_token" "reasoning_end_token" should be in model config, heck, put all three into per model config.
And there should be a detection of chat template if the <reasoning_start_token> is really there to prevent undesired adding/parsing. Add the token(add none if none) to the full response then parsing is a better idea.

Orion-zhen · 2025-03-19T04:38:12Z

@Orion-zhen Can this now be configured in tabby_config.yml? That is where per model config resides. "reasoning" set in global but "reasoning_start_token" "reasoning_end_token" should be in model config, heck, put all three into per model config. And there should be a detection of chat template if the <reasoning_start_token> is really there to prevent undesired adding/parsing. Add the token(add none if none) to the full response then parsing is a better idea.

Hi @Originalimoc.

this config can now be configured in config.yml in the model section, just like vision flag.
reasoning, reasoning_start_token and reasoning_end_token are all placed in per model config
I'm currently not considering to check whether reasoning_start_token is in the chat template or not. I have warned users in config_sample.yml not to enable reasoning if the model is not a reasoning model. Checking chat template might further increase code complexity. Similar implementations like vllm won't check chat template, too.

Orion-zhen · 2025-03-19T04:54:25Z

Although current reasoning models such as R1, QwQ-32B, etc. have reasoning_start_token in their chat template, it's uncertain if future models can naturally think without explicitly putting reasoning_start_token in chat template. Detection of reasoning_start_token in chat template might not be a good criterion.

Originalimoc · 2025-03-19T06:17:52Z

@Orion-zhen
1/ I mean in model_folder(like Qwen2-VL-7B)/tabby_config.yml, which look like this:

2/ That is why this should be put into per model config "tabby_config.yml" not global, you can not check chat template this way just manually set it in this config file.
3/ QwQ usually starts outputting its <think> after <assistant>. Some sizes of R1 distilled do that too. But it might just output the answer without <think></think>, chat template is just a steering method.

Originalimoc · 2025-03-19T06:22:28Z

The current per model config is actually buggy and I have some fixes to it which not pushed yet:

If your tabby_config.yml doesn't work probably because this. You can select merge(remove some debugging print) above diff into this PR and try again.

Orion-zhen · 2025-03-19T06:32:02Z

I'm not familiar with this per model config feature. TBH, this is the first time that I hear of it. I think reasoning feature is something equal to vision feature, putting it along with vision flag should make sense. Just as your comments, vision flag should be per model config too, but it still appears in global config.yml. According to your comment, per model config is kind of buggy, so I would better keep this config in config.yml before the feature becomes stable. And I would focus this PR on reasoning parser feature.

Originalimoc · 2025-03-19T06:36:08Z

It's just missing some field when merging, after merging it's processed by the same config code. I actually almost configured nothing in global config.yml because I use inline loading all the time and each model requires entirely different ctx length/cache mode/draft model/vision etc.

ladrua · 2025-03-20T18:47:49Z

@Orion-zhen Maybe we should check for \n or whitespace in the beginning of the splits? When I use this with QwQ 32b it always starts with a empty new line, both for the reasoning tokens and the normal content tokens.

Orion-zhen · 2025-03-21T01:40:09Z

@Orion-zhen Maybe we should check for \n or whitespace in the beginning of the splits? When I use this with QwQ 32b it always starts with a empty new line, both for the reasoning tokens and the normal content tokens.

In none-streaming cases, we can simply strip contents. But in streaming cases, checking for whitespace can be tricky. The model generates token per token, so we should track reasoning start/end token as well as next token(s). The whole streaming logic is implemented in a loop, so I would have to add more variables outside the loop to record status. I'm considering if it's worthy to add more complexity for this feature. Because in my use cases, chatbot apps like cherry-studio and chatbox can display messages normally even though there are whitespaces. Do you have any good idea? 🤔

Originalimoc · 2025-03-21T02:43:50Z

@ladrua Try add \n to the chat template after "<think>"?/enable token healing?

Originalimoc · 2025-03-21T04:25:22Z

After this PR get merged I'll create a new PR fixing the per model config loading.

ladrua · 2025-03-21T10:34:36Z

Do you have any good idea? 🤔

Something like this would work for the streaming bit:

async def stream_generate_chat_completion(
    prompt: str,
    embeddings: MultimodalEmbeddingWrapper,
    data: ChatCompletionRequest,
    request: Request,
    model_path: pathlib.Path,
):
    """Generator for the generation process."""
    abort_event = asyncio.Event()
    gen_queue = asyncio.Queue()
    gen_tasks: List[asyncio.Task] = []
    disconnect_task = asyncio.create_task(request_disconnect_loop(request))

    try:
        logger.info(f"Received chat completion streaming request {request.state.id}")

        for n in range(0, data.n):
            task_gen_params = data.model_copy(deep=True)

            gen_task = asyncio.create_task(
                _stream_collector(
                    n,
                    gen_queue,
                    prompt,
                    request.state.id,
                    abort_event,
                    embeddings=embeddings,
                    **task_gen_params.model_dump(exclude={"prompt"}),
                )
            )

            gen_tasks.append(gen_task)

        # We need to keep track of the text generated so we can resume the tool calls
        current_generation_text = ""

        is_reasoning_chunk = config.model.reasoning
        # Add a flag to track if this is the first token
        is_first_content_chunk = True
        is_first_reasoning_chunk = True

        # Consumer loop
        while True:
            if disconnect_task.done():
                abort_event.set()
                handle_request_disconnect(
                    f"Chat completion generation {request.state.id} cancelled by user."
                )

            generation = await gen_queue.get()
            # lets only append the text if we need it for tool calls later
            if data.tool_call_start and "text" in generation:
                current_generation_text += generation["text"]

            # check if we are running a tool model, and that we are at stop
            if data.tool_call_start and "stop_str" in generation:
                generations = await generate_tool_calls(
                    data,
                    [generation],
                    request,
                    current_generations=current_generation_text,
                )
                generation = generations[0]  # We only have one generation in this case

            # Stream collector will push an exception to the queue if it fails
            if isinstance(generation, Exception):
                raise generation

            if (
                unwrap(generation.get("text"), "") == config.model.reasoning_start_token
                and config.model.reasoning
            ):
                # Update reasoning chunk flag
                is_reasoning_chunk = True
                # And skip this token
                continue
            if (
                unwrap(generation.get("text"), "") == config.model.reasoning_end_token
                and config.model.reasoning
            ):
                # Update reasoning chunk flag
                is_reasoning_chunk = False
                # And skip this token
                continue

            # Strip whitespace from the first token if it contains text
            if (
                is_first_content_chunk
                and "text" in generation
                and not is_reasoning_chunk
            ):
                # Apply lstrip() to the first token
                generation["text"] = generation["text"].lstrip()
                is_first_content_chunk = False

            if is_first_reasoning_chunk and "text" in generation and is_reasoning_chunk:
                # Apply lstrip() to the first token
                generation["text"] = generation["text"].lstrip()
                is_first_reasoning_chunk = False

            response = _create_stream_chunk(
                request.state.id,
                generation,
                model_path.name,
                is_reasoning_chunk=is_reasoning_chunk,
            )
            yield response.model_dump_json()

            # Check if all tasks are completed
            if all(task.done() for task in gen_tasks) and gen_queue.empty():
                # Send a usage chunk
                if data.stream_options and data.stream_options.include_usage:
                    usage_chunk = _create_stream_chunk(
                        request.state.id,
                        generation,
                        model_path.name,
                        is_usage_chunk=True,
                    )
                    yield usage_chunk.model_dump_json()

                logger.info(
                    f"Finished chat completion streaming request {request.state.id}"
                )

                yield "[DONE]"
                break
    except CancelledError:
        # Get out if the request gets disconnected

        if not disconnect_task.done():
            abort_event.set()
            handle_request_disconnect("Chat completion generation cancelled by user.")
    except Exception:
        yield get_generator_error(
            "Chat completion aborted. Please check the server console."
        )

Edit: Improved it a bit.

Orion-zhen added 5 commits March 1, 2025 13:02

✨ reasoning parser support

57f0e5f

🎨 format code according to ruff

dfded65

🎨 shorten lines

1bd3968

🎨 god, please let me pass format check

41a8aa6

🎨 TAT

f088843

✨ handle reasoning start token

9efb7aa

🔧 move reasoning config to model section

4519000

Originalimoc mentioned this pull request Mar 19, 2025

[REQUEST] Imporve compatibility with reasoning template And/Or Implement API response field "reasoning_content" like R1 API/llama.cpp #309

Closed

3 tasks

Merge branch 'main' into feat_reasoning

e1acf8c

✨ strip contents

beef2d0

Uh oh!

✨ reasoning parser support #295

Are you sure you want to change the base?

✨ reasoning parser support #295

Uh oh!

Conversation

Orion-zhen commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kingbri1 commented Mar 5, 2025

Uh oh!

Orion-zhen commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Orion-zhen commented Mar 6, 2025

Uh oh!

Orion-zhen commented Mar 15, 2025

Uh oh!

InfernalDread commented Mar 15, 2025

Uh oh!

ladrua commented Mar 18, 2025

Uh oh!

Orion-zhen commented Mar 18, 2025

Uh oh!

ladrua commented Mar 18, 2025

Uh oh!

Orion-zhen commented Mar 18, 2025

Uh oh!

kingbri1 commented Mar 18, 2025

Uh oh!

Orion-zhen commented Mar 18, 2025

Uh oh!

Orion-zhen commented Mar 18, 2025

Uh oh!

Originalimoc commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Orion-zhen commented Mar 19, 2025

Uh oh!

Orion-zhen commented Mar 19, 2025

Uh oh!

Originalimoc commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Originalimoc commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Orion-zhen commented Mar 19, 2025

Uh oh!

Originalimoc commented Mar 19, 2025

Uh oh!

ladrua commented Mar 20, 2025

Uh oh!

Orion-zhen commented Mar 21, 2025

Uh oh!

Originalimoc commented Mar 21, 2025

Uh oh!

Originalimoc commented Mar 21, 2025

Uh oh!

ladrua commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Orion-zhen commented Mar 1, 2025 •

edited

Loading

Orion-zhen commented Mar 6, 2025 •

edited

Loading

Originalimoc commented Mar 19, 2025 •

edited

Loading

Originalimoc commented Mar 19, 2025 •

edited

Loading

Originalimoc commented Mar 19, 2025 •

edited

Loading

ladrua commented Mar 21, 2025 •

edited

Loading