Skip to content

Conversation

Orion-zhen
Copy link

@Orion-zhen Orion-zhen commented Mar 1, 2025

The PR solves issue294

The PR implemented reasoning parser and streaming reasoning parser, and was tested on DeepSeek-R1-Distill-Llama-70B with frontend like Page Assist and Cherry Studio. In my tests, it worked well.

I added configuration entry points to config.yml and placed it under network model section.

@kingbri1
Copy link
Member

kingbri1 commented Mar 5, 2025

Hey @Orion-zhen, thanks for PRing! I see from the issue that Deepseek does extend the OAI API in this way, so it should be ok to add this into Tabby as this looks like a standard.

Can you please send a test request to make sure the PR works? I assume I'd use a deepseek distill to test this out?

@Orion-zhen
Copy link
Author

Orion-zhen commented Mar 6, 2025

Sure! Here are test requests on none-streaming and streaming request:

none-streaming:

from openai import OpenAI
client = OpenAI(api_key="<whatever-key>", base_url="<local-url>")

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
    model="<model-name>",
    messages=messages
)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print(reasoning_content)
print(content)

streaming:

from openai import OpenAI
client = OpenAI(api_key="<whatever-key>", base_url="<local-url>")

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
    model="<model-name>",
    messages=messages,
    stream=True
)

reasoning_content = ""
content = ""

for chunk in response:
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content
        print(reasoning_content)
    else:
        content += chunk.choices[0].delta.content
        print(content)

@Orion-zhen
Copy link
Author

I added reasoning start token handler, for the latest released model QwQ-32B can explicitly output <think> token.

@Orion-zhen
Copy link
Author

👋 Hi there, any update?

@InfernalDread
Copy link

I would also really like for this to be merged, I tested the author's branch and it works beautifully with QwQ.

@ladrua
Copy link

ladrua commented Mar 18, 2025

Any reason these settings are under the network section of the config file? Should they not be model specific?

@Orion-zhen
Copy link
Author

Any reason these settings are under the network section of the config file? Should they not be model specific?

Hi, @ladrua. Because as far as I think, it's an API level configuration. The standard API can contain reasoning_content even if the model is not a reasoning model. For example, you can use system prompt to force the model to think with <think> and </think> tokens, which should be regarded as reasoning_content as well.

@ladrua
Copy link

ladrua commented Mar 18, 2025

Hi :) I see, I just notice that some models use different thinking tokens. some use thinking(plural) some use reasoning, and since tabby supports model switching, would it not make sense to at least have the thinking tokens defined on a per model basis? Anyway, I have just tested your PR and its working great with qwq-32b. Thank you.

@Orion-zhen
Copy link
Author

Hi :) I see, I just notice that some models use different thinking tokens. some use thinking(plural) some use reasoning, and since tabby supports model switching, would it not make sense to at least have the thinking tokens defined on a per model basis? Anyway, I have just tested your PR and its working great with qwq-32b. Thank you.

Thank you for testing my PR! I think splitting configuration about the same topic (reasoning) into two different sections would bring inconvenience for users to setup and change configs. As for model switching, tabby provides inline switching and switching with config file. For inline switching, the config file is not changed, which means if the new loaded model uses another reasoning token, you should change it in the config file, too. For config files, I think it's better to put configs about reasoning together.

@kingbri1
Copy link
Member

Backreading this conversation, I agree with @ladrua here. This pattern has already been established for vision with the vision: true flag in config.

The network block is specifically for options pertaining to the server regarding IPs, ports, auth, etc. Reasoning doesn't seem to be a good fit here as that depends on the model. If users really want reasoning support from a non-reasoning model, then add it in the model block or inline config.

@Orion-zhen
Copy link
Author

Sure, I will change it then.

@Orion-zhen
Copy link
Author

@ladrua and @kingbri1, thank you for your insights.

@Originalimoc
Copy link

Originalimoc commented Mar 19, 2025

@Orion-zhen
Can this now be configured in tabby_config.yml? That is where per model config resides.
"reasoning" set in global but "reasoning_start_token" "reasoning_end_token" should be in model config, heck, put all three into per model config.
And there should be a detection of chat template if the <reasoning_start_token> is really there to prevent undesired adding/parsing. Add the token(add none if none) to the full response then parsing is a better idea.

@Orion-zhen
Copy link
Author

@Orion-zhen Can this now be configured in tabby_config.yml? That is where per model config resides. "reasoning" set in global but "reasoning_start_token" "reasoning_end_token" should be in model config, heck, put all three into per model config. And there should be a detection of chat template if the <reasoning_start_token> is really there to prevent undesired adding/parsing. Add the token(add none if none) to the full response then parsing is a better idea.

Hi @Originalimoc.

  1. this config can now be configured in config.yml in the model section, just like vision flag.
  2. reasoning, reasoning_start_token and reasoning_end_token are all placed in per model config
  3. I'm currently not considering to check whether reasoning_start_token is in the chat template or not. I have warned users in config_sample.yml not to enable reasoning if the model is not a reasoning model. Checking chat template might further increase code complexity. Similar implementations like vllm won't check chat template, too.

@Orion-zhen
Copy link
Author

Although current reasoning models such as R1, QwQ-32B, etc. have reasoning_start_token in their chat template, it's uncertain if future models can naturally think without explicitly putting reasoning_start_token in chat template. Detection of reasoning_start_token in chat template might not be a good criterion.

@Originalimoc
Copy link

Originalimoc commented Mar 19, 2025

@Orion-zhen
1/ I mean in model_folder(like Qwen2-VL-7B)/tabby_config.yml, which look like this:
image

2/ That is why this should be put into per model config "tabby_config.yml" not global, you can not check chat template this way just manually set it in this config file.
3/ QwQ usually starts outputting its <think> after <assistant>. Some sizes of R1 distilled do that too. But it might just output the answer without <think></think>, chat template is just a steering method.

@Originalimoc
Copy link

Originalimoc commented Mar 19, 2025

The current per model config is actually buggy and I have some fixes to it which not pushed yet:
image
image
image
If your tabby_config.yml doesn't work probably because this. You can select merge(remove some debugging print) above diff into this PR and try again.

@Orion-zhen
Copy link
Author

I'm not familiar with this per model config feature. TBH, this is the first time that I hear of it. I think reasoning feature is something equal to vision feature, putting it along with vision flag should make sense. Just as your comments, vision flag should be per model config too, but it still appears in global config.yml. According to your comment, per model config is kind of buggy, so I would better keep this config in config.yml before the feature becomes stable. And I would focus this PR on reasoning parser feature.

@Originalimoc
Copy link

It's just missing some field when merging, after merging it's processed by the same config code. I actually almost configured nothing in global config.yml because I use inline loading all the time and each model requires entirely different ctx length/cache mode/draft model/vision etc.

@ladrua
Copy link

ladrua commented Mar 20, 2025

@Orion-zhen Maybe we should check for \n or whitespace in the beginning of the splits? When I use this with QwQ 32b it always starts with a empty new line, both for the reasoning tokens and the normal content tokens.

@Orion-zhen
Copy link
Author

@Orion-zhen Maybe we should check for \n or whitespace in the beginning of the splits? When I use this with QwQ 32b it always starts with a empty new line, both for the reasoning tokens and the normal content tokens.

In none-streaming cases, we can simply strip contents. But in streaming cases, checking for whitespace can be tricky. The model generates token per token, so we should track reasoning start/end token as well as next token(s). The whole streaming logic is implemented in a loop, so I would have to add more variables outside the loop to record status. I'm considering if it's worthy to add more complexity for this feature. Because in my use cases, chatbot apps like cherry-studio and chatbox can display messages normally even though there are whitespaces. Do you have any good idea? 🤔

@Originalimoc
Copy link

@ladrua Try add \n to the chat template after "<think>"?/enable token healing?

@Originalimoc
Copy link

After this PR get merged I'll create a new PR fixing the per model config loading.

@ladrua
Copy link

ladrua commented Mar 21, 2025

Do you have any good idea? 🤔

Something like this would work for the streaming bit:

async def stream_generate_chat_completion(
    prompt: str,
    embeddings: MultimodalEmbeddingWrapper,
    data: ChatCompletionRequest,
    request: Request,
    model_path: pathlib.Path,
):
    """Generator for the generation process."""
    abort_event = asyncio.Event()
    gen_queue = asyncio.Queue()
    gen_tasks: List[asyncio.Task] = []
    disconnect_task = asyncio.create_task(request_disconnect_loop(request))

    try:
        logger.info(f"Received chat completion streaming request {request.state.id}")

        for n in range(0, data.n):
            task_gen_params = data.model_copy(deep=True)

            gen_task = asyncio.create_task(
                _stream_collector(
                    n,
                    gen_queue,
                    prompt,
                    request.state.id,
                    abort_event,
                    embeddings=embeddings,
                    **task_gen_params.model_dump(exclude={"prompt"}),
                )
            )

            gen_tasks.append(gen_task)

        # We need to keep track of the text generated so we can resume the tool calls
        current_generation_text = ""

        is_reasoning_chunk = config.model.reasoning
        # Add a flag to track if this is the first token
        is_first_content_chunk = True
        is_first_reasoning_chunk = True

        # Consumer loop
        while True:
            if disconnect_task.done():
                abort_event.set()
                handle_request_disconnect(
                    f"Chat completion generation {request.state.id} cancelled by user."
                )

            generation = await gen_queue.get()
            # lets only append the text if we need it for tool calls later
            if data.tool_call_start and "text" in generation:
                current_generation_text += generation["text"]

            # check if we are running a tool model, and that we are at stop
            if data.tool_call_start and "stop_str" in generation:
                generations = await generate_tool_calls(
                    data,
                    [generation],
                    request,
                    current_generations=current_generation_text,
                )
                generation = generations[0]  # We only have one generation in this case

            # Stream collector will push an exception to the queue if it fails
            if isinstance(generation, Exception):
                raise generation

            if (
                unwrap(generation.get("text"), "") == config.model.reasoning_start_token
                and config.model.reasoning
            ):
                # Update reasoning chunk flag
                is_reasoning_chunk = True
                # And skip this token
                continue
            if (
                unwrap(generation.get("text"), "") == config.model.reasoning_end_token
                and config.model.reasoning
            ):
                # Update reasoning chunk flag
                is_reasoning_chunk = False
                # And skip this token
                continue

            # Strip whitespace from the first token if it contains text
            if (
                is_first_content_chunk
                and "text" in generation
                and not is_reasoning_chunk
            ):
                # Apply lstrip() to the first token
                generation["text"] = generation["text"].lstrip()
                is_first_content_chunk = False

            if is_first_reasoning_chunk and "text" in generation and is_reasoning_chunk:
                # Apply lstrip() to the first token
                generation["text"] = generation["text"].lstrip()
                is_first_reasoning_chunk = False

            response = _create_stream_chunk(
                request.state.id,
                generation,
                model_path.name,
                is_reasoning_chunk=is_reasoning_chunk,
            )
            yield response.model_dump_json()

            # Check if all tasks are completed
            if all(task.done() for task in gen_tasks) and gen_queue.empty():
                # Send a usage chunk
                if data.stream_options and data.stream_options.include_usage:
                    usage_chunk = _create_stream_chunk(
                        request.state.id,
                        generation,
                        model_path.name,
                        is_usage_chunk=True,
                    )
                    yield usage_chunk.model_dump_json()

                logger.info(
                    f"Finished chat completion streaming request {request.state.id}"
                )

                yield "[DONE]"
                break
    except CancelledError:
        # Get out if the request gets disconnected

        if not disconnect_task.done():
            abort_event.set()
            handle_request_disconnect("Chat completion generation cancelled by user.")
    except Exception:
        yield get_generator_error(
            "Chat completion aborted. Please check the server console."
        )

Edit: Improved it a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants