Skip to content

LiteLLM streaming not truly asynchronous - returns batch processed chunks #1306

Open
@DennySORA

Description

@DennySORA

Describe the bug
The streaming implementation in lite_llm.py contains an async/await usage error that causes a "fake streaming" phenomenon. The current implementation waits for all content to be fully returned before chunking it into Events, rather than providing true real-time streaming.

To Reproduce
Steps to reproduce the behavior:

  1. Install google-adk==1.2.1
  2. Use LiteLLM model with streaming calls
  3. Observe the response behavior
  4. Notice that streaming is not real-time but rather batch processing disguised as streaming

Expected behavior
Should implement true real-time streaming where each chunk is returned immediately as it's generated, rather than waiting for all content to complete before batch processing.

Root Cause Analysis
The issue occurs in lite_llm.py#L682-L683:

Current incorrect implementation:

for part in self.llm_client.completion(**completion_args):
    for chunk, finish_reason in model_response_to_chunk(part):

Correct implementation should be:

async for part in await self.llm_client.acompletion(**completion_args):
    for chunk, finish_reason in model_response_to_chunk(part):

Technical Explanation:

  1. Synchronous Issue: The current completion() method is synchronous and blocks the main thread in streaming mode until all data is received before starting iteration
  2. Fake Streaming Phenomenon: Due to synchronous blocking, all chunks get buffered and processed at once, creating an illusion of streaming when it's actually batch processing
  3. Proper Async Implementation: Using acompletion() with async for enables true non-blocking streaming

Python Async Principles:

  • async for allows execution to pause at each iteration, giving other coroutines a chance to execute
  • await ensures proper waiting for async functions without blocking the event loop
  • True streaming requires non-blocking iterators, which can only be achieved through async for
  • Synchronous for loops wait for the entire iterator to complete before continuing, losing the real-time nature of streaming

Desktop (please complete the following information):

  • OS: macOS
  • Python version: 3.13
  • ADK version: 1.2.1

Additional context
Reference to LiteLLM's official documentation for correct async streaming implementation:
https://docs.litellm.ai/docs/completion/stream#async-streaming

Video demonstration: https://youtu.be/nF1G4leSbSI

This issue affects all applications that rely on true streaming responses, particularly real-time interactive chat applications.

Proposed Fix
The fix requires changing the synchronous iteration to asynchronous iteration and ensuring the method calling this code is properly marked as async and awaited appropriately throughout the call stack.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions