Skip to content

Conversation

@asimurka
Copy link

@asimurka asimurka commented Nov 10, 2025

Description

This PR adds exception handling for LLM quota limit exceedances and introduces a unified custom response model for all quota-related errors, including model-level and internal quota exceedances.

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Query endpoints (v1, v2, streaming) now return standardized quota-exceeded errors (HTTP 429) with structured detail and examples; added public QuotaExceededResponse.
  • Documentation

    • OpenAPI updated to document 429 quota responses and include example quota-failure scenarios across relevant endpoints.
  • Tests

    • Added unit tests confirming 429 quota-exceeded behavior for query and streaming endpoints.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

Walkthrough

Adds a new public QuotaExceededResponse schema and wires HTTP 429 quota-exceeded handling into v1/v2 query and streaming endpoints by catching RateLimitError and returning QuotaExceededResponse; updates OpenAPI docs and tests; moves litellm to top-level dependencies.

Changes

Cohort / File(s) Change Summary
OpenAPI Schema & Docs
docs/openapi.json
Added public QuotaExceededResponse schema (with examples) and added 429 responses referencing it to /v1/query, /v1/streaming_query, and /v2/query; updated response mappings accordingly.
Query Endpoints
src/app/endpoints/query.py, src/app/endpoints/query_v2.py, src/app/endpoints/streaming_query.py
Imported RateLimitError and QuotaExceededResponse; added 429 entries to endpoint response definitions; catch RateLimitError in handlers/generators and translate to HTTP 429 with a QuotaExceededResponse payload.
Response Models
src/models/responses.py
Added QuotaExceededResponse(AbstractErrorResponse) with constructor def __init__(self, user_id: str, model_name: str, limit: int) and model_config examples documenting quota-exceeded scenarios.
Unit Tests
tests/unit/app/endpoints/test_query.py, tests/unit/app/endpoints/test_query_v2.py, tests/unit/app/endpoints/test_streaming_query.py
Added tests test_query_endpoint_quota_exceeded that mock RateLimitError to assert HTTP 429 and quota detail/cause; added RateLimitError imports. Note: test_query_v2.py contains a duplicated test and redundant import.
Project Config
pyproject.toml
Moved litellm>=1.75.5.post1 from the llslibdev group to top-level dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant E as Query Endpoint (v1/v2/streaming)
    participant A as Handler/Agent
    participant L as LLM Provider
    participant H as Error Handler

    C->>E: POST /v1/query or /v2/query or start streaming
    E->>A: process request / create turn
    A->>L: invoke model
    alt RateLimitError raised
        L-->>A: raises RateLimitError
        A-->>E: propagate RateLimitError
        E->>H: catch RateLimitError
        H-->>E: build QuotaExceededResponse(user_id, model_name, limit)
        E-->>C: HTTP 429 {"detail": "...", "cause": "...model_name..."}
    else Success
        L-->>A: model response
        A-->>E: normal response
        E-->>C: HTTP 200 / stream
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Extra attention:
    • src/app/endpoints/query.py — ensure both catch sites consistently translate RateLimitError to 429 and payload shape matches QuotaExceededResponse.
    • src/app/endpoints/streaming_query.py — verify generator-level exception handling and streaming teardown on 429.
    • docs/openapi.json — confirm schema references and examples are valid JSON Schema and referenced paths.
    • tests/unit/app/endpoints/test_query_v2.py — remove duplicated test and redundant imports; ensure mocks produce the same QuotaExceededResponse shape.

Possibly related PRs

Suggested labels

ok-to-test

Poem

🐰 I hopped into code at break of day,
Found tokens trimmed and models kept at bay,
I caught RateLimit with a gentle bite,
Returned a friendly 429 light,
Now quotas sleep beneath moonlight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and concisely summarizes the main change: adding model quota limit exception handling across the API endpoints.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link

openshift-ci bot commented Nov 10, 2025

Hi @asimurka. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@asimurka asimurka force-pushed the exceeded_quota_error_handling branch from 2b65716 to c9f1325 Compare November 10, 2025 09:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
src/models/responses.py (1)

1145-1207: New QuotaExceededResponse: unify message text; remove trailing whitespace

  • Unify the detail.response wording with your examples and endpoint responses. Pick one canonical phrase and use it everywhere. For model-specific cases, consider a helper like QuotaExceededResponse.model_limit(model) to format cause consistently.
  • Strip trailing whitespace at Line 1159 to fix linter.
-                response="Quota exceeded",
+                response="The quota has been exceeded",
src/app/endpoints/streaming_query.py (1)

930-937: Translate RateLimitError to HTTP 429 — consider message consistency and mid‑stream handling

  • The response uses “Model quota exceeded”; schemas and models use variants. Consider standardizing.
  • Optional: if RateLimitError can occur after streaming starts, catch it inside response_generator and emit an OLS “error” SSE to avoid abrupt stream termination.
tests/unit/app/endpoints/test_query.py (1)

2267-2308: Unify patch target with other tests; remove trailing whitespace

All other tests in the file patch client.AsyncLlamaStackClientHolder.get_client (lines 191, 1445, 1474, 1526, 1585), but this test uses app.endpoints.query.AsyncLlamaStackClientHolder.get_client. Update to match the consistent pattern. Trailing whitespace confirmed on line 2298 (blank line) and lines 2301–2302 (after commas).

Apply:

-    mocker.patch(
-        "app.endpoints.query.AsyncLlamaStackClientHolder.get_client",
-        return_value=mock_client
-    )
-    mocker.patch(
-        "app.endpoints.query.handle_mcp_headers_with_toolgroups",
-        return_value={}
-    )
-    
-    with pytest.raises(HTTPException) as exc_info:
-        await query_endpoint_handler(
-            dummy_request, 
-            query_request=query_request, 
-            auth=MOCK_AUTH
+    mocker.patch(
+        "client.AsyncLlamaStackClientHolder.get_client",
+        return_value=mock_client,
+    )
+    mocker.patch(
+        "app.endpoints.query.handle_mcp_headers_with_toolgroups",
+        return_value={},
+    )
+
+    with pytest.raises(HTTPException) as exc_info:
+        await query_endpoint_handler(
+            dummy_request,
+            query_request=query_request,
+            auth=MOCK_AUTH
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a5255a and c9f1325.

📒 Files selected for processing (8)
  • docs/openapi.json (15 hunks)
  • src/app/endpoints/query.py (4 hunks)
  • src/app/endpoints/query_v2.py (2 hunks)
  • src/app/endpoints/streaming_query.py (4 hunks)
  • src/models/responses.py (1 hunks)
  • tests/unit/app/endpoints/test_query.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (3 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/unit/app/endpoints/test_streaming_query.py (4)
tests/unit/app/endpoints/test_query.py (1)
  • test_query_endpoint_quota_exceeded (2268-2308)
tests/unit/app/endpoints/test_query_v2.py (1)
  • test_query_endpoint_quota_exceeded (448-484)
src/models/requests.py (1)
  • QueryRequest (73-225)
src/app/endpoints/streaming_query.py (1)
  • streaming_query_endpoint_handler (705-952)
src/app/endpoints/query_v2.py (1)
src/models/responses.py (1)
  • QuotaExceededResponse (1145-1207)
src/app/endpoints/query.py (1)
src/models/responses.py (1)
  • QuotaExceededResponse (1145-1207)
tests/unit/app/endpoints/test_query.py (4)
tests/unit/app/endpoints/test_query_v2.py (2)
  • test_query_endpoint_quota_exceeded (448-484)
  • dummy_request (33-36)
tests/unit/app/endpoints/test_streaming_query.py (1)
  • test_query_endpoint_quota_exceeded (1800-1841)
src/models/requests.py (1)
  • QueryRequest (73-225)
src/app/endpoints/query.py (1)
  • query_endpoint_handler (442-464)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (1)
  • QuotaExceededResponse (1145-1207)
tests/unit/app/endpoints/test_query_v2.py (3)
tests/unit/app/endpoints/test_query.py (2)
  • test_query_endpoint_quota_exceeded (2268-2308)
  • dummy_request (60-69)
src/models/requests.py (1)
  • QueryRequest (73-225)
src/app/endpoints/query_v2.py (1)
  • query_endpoint_handler_v2 (284-306)
🪛 GitHub Actions: Black
tests/unit/app/endpoints/test_streaming_query.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

src/app/endpoints/query_v2.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

src/models/responses.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

src/app/endpoints/query.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

tests/unit/app/endpoints/test_query.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

src/app/endpoints/streaming_query.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

tests/unit/app/endpoints/test_query_v2.py

[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .

🪛 GitHub Actions: Integration tests
src/app/endpoints/query.py

[error] 11-11: ModuleNotFoundError: No module named 'litellm'. The OpenAPI/test fixtures import litellm; ensure the dependency is installed (e.g., add to requirements and reinstall).

🪛 GitHub Actions: Python linter
tests/unit/app/endpoints/test_streaming_query.py

[error] 1831-1831: Trailing whitespace (C0303).


[error] 1834-1834: Trailing whitespace (C0303).


[error] 1835-1835: Trailing whitespace (C0303).

src/models/responses.py

[error] 1159-1159: Trailing whitespace (C0303) in module.

tests/unit/app/endpoints/test_query.py

[error] 2298-2298: Trailing whitespace (C0303).


[error] 2301-2301: Trailing whitespace (C0303).


[error] 2302-2302: Trailing whitespace (C0303).

tests/unit/app/endpoints/test_query_v2.py

[error] 474-474: Trailing whitespace (C0303).


[error] 477-477: Trailing whitespace (C0303).


[error] 478-478: Trailing whitespace (C0303).


[error] 11-11: Reimport 'MockerFixture' (reimported). (W0404)


[error] 11-11: Imports from package pytest_mock are not grouped (C0412).

🪛 GitHub Actions: Ruff
tests/unit/app/endpoints/test_query_v2.py

[error] 7-11: ruff check failed: F811 redefinition of unused 'MockerFixture'. The name 'MockerFixture' is defined previously in this file. Fixable with '--fix'. Command: 'uv tool run ruff check . --per-file-ignores=tests/:S101 --per-file-ignores=scripts/:S101'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: e2e_tests (ci)
  • GitHub Check: e2e_tests (azure)
🔇 Additional comments (6)
tests/unit/app/endpoints/test_query.py (1)

13-13: Import OK

Using litellm.exceptions.RateLimitError in tests is appropriate for simulating provider rate limits.

tests/unit/app/endpoints/test_query_v2.py (2)

23-29: MOCK_AUTH constant: OK

Consistent with other tests; good reuse.


447-484: Revert patch target recommendation; current path is correct

The patch target in the test is already correct. Since app.endpoints.query imports the class directly (from client import AsyncLlamaStackClientHolder), you must patch it where it's used:

mocker.patch(
    "app.endpoints.query.AsyncLlamaStackClientHolder.get_client",
    return_value=mock_client
)

Patching client.AsyncLlamaStackClientHolder.get_client instead would not affect the imported reference in the query module and the mock would not be applied. Do not change the patch target.

Trim trailing whitespace on lines 474, 477, 478 as flagged by CI.

Likely an incorrect or invalid review comment.

tests/unit/app/endpoints/test_streaming_query.py (1)

9-9: Import OK

Using RateLimitError to simulate provider throttling is correct.

src/app/endpoints/streaming_query.py (1)

108-111: Expose 429 in OpenAPI — good

Adding 429 with QuotaExceededResponse improves docs and client generation.

src/app/endpoints/query_v2.py (1)

63-66: Now I need to examine the QuotaExceededResponse model and understand what it actually returns:

Let me inspect the codebase directly to understand the QuotaExceededResponse model:<!-- [scratchpad_start]
[task review_file_query_v2 analysis_quota_messages final_analysis]
Progress: Web search didn't find QuotaExceededResponse directly, need direct codebase inspection now.

[key facts from grep earlier]

  • Tests expect: detail["response"] == "Model quota exceeded"
  • OpenAPI description says: "The quota has been exceeded"
  • src/models/responses.py has these messages at lines 1152-1195
  • src/utils/quota.py line 70 sets: message = "The quota has been exceeded"
  • src/app/endpoints/*.py responses use: "Model quota exceeded" in actual returns

[critical observation]

  • The QuotaExceededResponse model must have a "response" field
  • Tests validate that field is "Model quota exceeded"
  • OpenAPI description is "The quota has been exceeded"
  • These are DIFFERENT

[conclusion so far]

  • OpenAPI description ≠ actual response message field
  • This is worth checking but may not be a bug
  • However, if OpenAPI is supposed to document what's returned, there's a mismatch

[final determination]

  • Cannot verify without seeing the actual QuotaExceededResponse model code
  • The review comment's concern about aligning message text across code/tests/examples is VALID
  • But the question of whether OpenAPI description should match response field is nuanced
  • Original review suggests standardizing across all uses, which is reasonable

[action]

  • Mark as since I cannot definitively determine if this is an issue
  • The evidence shows message inconsistency exists but interpretation depends on codebase design
    </scratchpad_end] -->

Based on my analysis of the search results from the codebase, I can see there are quota message inconsistencies:

From the grep output:

  • Tests consistently expect: "Model quota exceeded"
  • OpenAPI descriptions use: "The quota has been exceeded"
  • Schema examples show: "Quota exceeded" and "The quota has been exceeded"

The original review comment's concern about message alignment is valid. However, after researching OpenAPI standards, the OpenAPI description field is a human-readable explanation of the response that can contain details helpful to developers and end-users, and it's good practice to provide clear and concise descriptions. The description field and actual response body content can differ in purpose.

That said, the inconsistency across multiple locations (tests, actual responses, schema examples, and descriptions) warrants developer attention to ensure clarity and maintainability.


429 response mapping added — align wording across code and tests

The OpenAPI 429 mapping is correct. However, there's inconsistency in quota error messages across the codebase: tests expect "Model quota exceeded", but OpenAPI description and schema examples use "The quota has been exceeded". Consider standardizing the message text across endpoints, tests, and examples to match the actual response field value returned to clients, ensuring clarity and consistency in error documentation.

@asimurka asimurka force-pushed the exceeded_quota_error_handling branch from c9f1325 to 8b933b4 Compare November 10, 2025 10:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/app/endpoints/streaming_query.py (1)

808-833: Handle RateLimitError inside the streaming generator

If RateLimitError is raised during iteration, it escapes after the 200 response has started, resulting in a broken stream instead of a structured quota error. Catch it inside the generator and emit an SSE error event, then end the stream.

Apply:

-            async for chunk in turn_response:
-                if chunk.event is None:
-                    continue
-                p = chunk.event.payload
-                if p.event_type == "turn_complete":
-                    summary.llm_response = interleaved_content_as_str(
-                        p.turn.output_message.content
-                    )
-                    latest_turn = p.turn
-                    system_prompt = get_system_prompt(query_request, configuration)
-                    try:
-                        update_llm_token_count_from_turn(
-                            p.turn, model_id, provider_id, system_prompt
-                        )
-                    except Exception:  # pylint: disable=broad-except
-                        logger.exception("Failed to update token usage metrics")
-                elif p.event_type == "step_complete":
-                    if p.step_details.step_type == "tool_execution":
-                        summary.append_tool_calls_from_llama(p.step_details)
-
-                for event in stream_build_event(
-                    chunk, chunk_id, metadata_map, media_type, conversation_id
-                ):
-                    chunk_id += 1
-                    yield event
+            try:
+                async for chunk in turn_response:
+                    if chunk.event is None:
+                        continue
+                    p = chunk.event.payload
+                    if p.event_type == "turn_complete":
+                        summary.llm_response = interleaved_content_as_str(
+                            p.turn.output_message.content
+                        )
+                        latest_turn = p.turn
+                        system_prompt = get_system_prompt(query_request, configuration)
+                        try:
+                            update_llm_token_count_from_turn(
+                                p.turn, model_id, provider_id, system_prompt
+                            )
+                        except Exception:  # pylint: disable=broad-except
+                            logger.exception("Failed to update token usage metrics")
+                    elif p.event_type == "step_complete":
+                        if p.step_details.step_type == "tool_execution":
+                            summary.append_tool_calls_from_llama(p.step_details)
+
+                    for event in stream_build_event(
+                        chunk, chunk_id, metadata_map, media_type, conversation_id
+                    ):
+                        chunk_id += 1
+                        yield event
+            except RateLimitError as e:
+                # Cannot change HTTP status mid-stream; emit SSE error and stop.
+                model_name = getattr(e, "model", "unknown")
+                logger.warning("Rate limited while streaming for model %s", model_name)
+                if media_type == MEDIA_TYPE_TEXT:
+                    yield f"Error: Model quota exceeded: The token quota for model {model_name} has been exceeded."
+                else:
+                    yield format_stream_data(
+                        {
+                            "event": "error",
+                            "data": {
+                                "status_code": 429,
+                                "response": "Model quota exceeded",
+                                "cause": f"The token quota for model {model_name} has been exceeded.",
+                            },
+                        }
+                    )
+                return
docs/openapi.json (1)

971-987: Type mismatch: last_message_timestamp example vs schema

ConversationData defines last_message_timestamp as number, but the example here uses an ISO string. Switch the example to a numeric epoch (e.g., 1704067200.0) or change the schema type.

-                                "last_message_timestamp": "2024-01-01T00:00:00Z"
+                                "last_message_timestamp": 1704067200.0
🧹 Nitpick comments (4)
src/app/endpoints/streaming_query.py (1)

930-937: Harden RateLimitError handling and align message

  • Use getattr for e.model to avoid attribute errors across versions.
  • Consider aligning the response text with schema/examples ("Model quota exceeded").
-    except RateLimitError as e:
-        raise HTTPException(
-            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
-            detail={
-                "response": "Model quota exceeded",
-                "cause": f"The token quota for model {e.model} has been exceeded.",
-            },
-        ) from e
+    except RateLimitError as e:
+        model_name = getattr(e, "model", "unknown")
+        raise HTTPException(
+            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
+            detail={
+                "response": "Model quota exceeded",
+                "cause": f"The token quota for model {model_name} has been exceeded.",
+            },
+        ) from e
docs/openapi.json (2)

474-483: v1 streaming_query: clarify 429 semantics

429 schema is correct for pre‑stream rejections. If rate limiting occurs mid‑stream, the server will emit an SSE "error" event with HTTP 200 (after adopting generator handling). Consider adding a short note to avoid client confusion.


3620-3676: Align QuotaExceededResponse examples with server wording

Examples use "The quota has been exceeded"/"The model quota has been exceeded" while code returns "Model quota exceeded". Unify to a single phrasing for clients.

-                            "response": "The quota has been exceeded"
+                            "response": "Quota exceeded"
@@
-                            "response": "The model quota has been exceeded"
+                            "response": "Model quota exceeded"
tests/unit/app/endpoints/test_query_v2.py (1)

22-28: Consider sharing MOCK_AUTH across test modules.

This constant is duplicated in tests/unit/app/endpoints/test_query.py (lines 51-56). Consider moving it to a shared test fixture or conftest to avoid duplication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9f1325 and 8b933b4.

📒 Files selected for processing (8)
  • docs/openapi.json (15 hunks)
  • src/app/endpoints/query.py (4 hunks)
  • src/app/endpoints/query_v2.py (2 hunks)
  • src/app/endpoints/streaming_query.py (4 hunks)
  • src/models/responses.py (1 hunks)
  • tests/unit/app/endpoints/test_query.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (3 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/unit/app/endpoints/test_streaming_query.py
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/unit/app/endpoints/test_query_v2.py (3)
tests/unit/app/endpoints/test_query.py (2)
  • test_query_endpoint_quota_exceeded (2268-2308)
  • dummy_request (60-69)
src/models/requests.py (1)
  • QueryRequest (73-225)
src/app/endpoints/query_v2.py (1)
  • query_endpoint_handler_v2 (284-306)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (3)
  • ForbiddenResponse (1120-1142)
  • UnauthorizedResponse (1094-1117)
  • QuotaExceededResponse (1145-1207)
tests/unit/app/endpoints/test_query.py (3)
tests/unit/app/endpoints/test_query_v2.py (2)
  • test_query_endpoint_quota_exceeded (447-483)
  • dummy_request (32-35)
tests/unit/app/endpoints/test_streaming_query.py (1)
  • test_query_endpoint_quota_exceeded (1800-1841)
src/app/endpoints/query.py (1)
  • query_endpoint_handler (442-464)
🔇 Additional comments (4)
src/app/endpoints/streaming_query.py (1)

108-111: Good: documented 429 with QuotaExceededResponse

429 mapping added to the OpenAPI responses for streaming. LGTM.

docs/openapi.json (2)

378-387: v1 query: 429 quota response added

The 429 path referencing QuotaExceededResponse is correct and matches the server behavior. LGTM.


1232-1317: Implementation verified and aligned with spec

All verification points confirmed:

  • Handler query_endpoint_handler_v2 exists and is routed via @router.post("/query", responses=query_v2_response) in src/app/endpoints/query_v2.py
  • QuotaExceededResponse is imported and wired to the 429 response code in the same file
  • Test coverage includes test_query_endpoint_quota_exceeded which validates the 429 status and error response body

The implementation mirrors v1 and docs are consistent with the codebase.

tests/unit/app/endpoints/test_query.py (1)

2267-2308: LGTM! Quota-exceeded test for v1 endpoint is well-structured.

The test correctly simulates a RateLimitError from the agent and verifies the HTTP 429 response with appropriate error details.

@asimurka asimurka force-pushed the exceeded_quota_error_handling branch 4 times, most recently from d4ed9b2 to adc06b4 Compare November 10, 2025 11:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/models/responses.py (1)

1145-1211: Address wording inconsistency and unused constructor parameters.

The class has several design issues:

  1. Inconsistent response wording: The constructor uses "The quota has been exceeded" (line 1157), but endpoint code uses "Model quota exceeded" (streaming_query.py line 939), and examples include both variants (lines 1168, 1204).

  2. Unused constructor parameters: model_name and limit are marked as unused (lines 1151-1152), suggesting the constructor signature doesn't match the intended use cases.

  3. TODO indicates incomplete implementation: Line 1161 mentions factories are needed, confirming this is a work-in-progress.

Given the TODO comment and the fact that endpoints are manually constructing details rather than using this class, consider either:

Option 1: Update the constructor to support model-level quota errors:

     def __init__(
         self,
         user_id: str,
-        model_name: str,  # pylint: disable=unused-argument
-        limit: int,  # pylint: disable=unused-argument
+        model_name: str | None = None,
+        limit: int | None = None,
     ):
         """Initialize a QuotaExceededResponse."""
+        if model_name:
+            response_msg = "Model quota exceeded"
+            cause_msg = f"The token quota for model {model_name} has been exceeded."
+        else:
+            response_msg = "Quota exceeded"
+            cause_msg = f"User {user_id} has no available tokens."
         super().__init__(
             detail=DetailModel(
-                response="The quota has been exceeded",
-                cause=(f"User {user_id} has no available tokens."),
+                response=response_msg,
+                cause=cause_msg,
             )
         )

Option 2: Add a factory method as suggested in past reviews:

    @classmethod
    def for_model(cls, model_name: str) -> "QuotaExceededResponse":
        """Create a quota response for model-level quota exceedance."""
        obj = cls.__new__(cls)
        obj.detail = DetailModel(
            response="Model quota exceeded",
            cause=f"The token quota for model {model_name} has been exceeded.",
        )
        return obj

Then endpoints can use: raise HTTPException(status_code=429, detail=QuotaExceededResponse.for_model(used_model).dump_detail())

tests/unit/app/endpoints/test_query_v2.py (1)

446-479: Fix incomplete mock setup to match successful test pattern.

The test has several missing mocks compared to the successful test at lines 363-413, which will likely cause it to fail or not test the intended code path:

  1. Missing configuration mock (compare lines 368-371): The handler expects configuration to be loaded
  2. Missing quota function mocks (compare lines 397-399): Quota checks are performed before the RateLimitError is raised
  3. Missing request state (compare line 402): The handler accesses request.state.authorized_actions
  4. Missing database mock (compare line 394): The handler accesses the database session
  5. Missing other required mocks: Model hints, conversation ownership, etc.

Apply this diff to fix the mock setup:

 @pytest.mark.asyncio
 async def test_query_endpoint_quota_exceeded(
     mocker: MockerFixture, dummy_request: Request
 ) -> None:
     """Test that query endpoint raises HTTP 429 when model quota is exceeded."""
+    # Mock configuration
+    mock_config = mocker.Mock()
+    mock_config.llama_stack_configuration = mocker.Mock()
+    mock_config.quota_limiters = []
+    mocker.patch("app.endpoints.query_v2.configuration", mock_config)
+    
     query_request = QueryRequest(
         query="What is OpenStack?",
         provider="openai",
         model="gpt-4-turbo",
         attachments=[],
     )  # type: ignore
     mock_client = mocker.AsyncMock()
     mock_client.responses.create.side_effect = RateLimitError(
         model="gpt-4-turbo", llm_provider="openai", message=""
     )
+    mock_client.models.list = mocker.AsyncMock(return_value=[mocker.Mock()])
+    
+    mocker.patch("app.endpoints.query.evaluate_model_hints", return_value=(None, None))
     mocker.patch(
         "app.endpoints.query.select_model_and_provider_id",
         return_value=("openai/gpt-4-turbo", "gpt-4-turbo", "openai"),
     )
     mocker.patch("app.endpoints.query.validate_model_provider_override")
     mocker.patch(
         "client.AsyncLlamaStackClientHolder.get_client",
         return_value=mock_client,
     )
+    
+    # Mock quota functions
+    mocker.patch("utils.quota.check_tokens_available")
+    mocker.patch("utils.quota.consume_tokens")
+    mocker.patch("utils.quota.get_available_quotas", return_value={})
+    
+    # Mock database and other functions
+    mocker.patch("app.endpoints.query.get_session")
+    mocker.patch("app.endpoints.query.is_transcripts_enabled", return_value=False)
+    mocker.patch("app.endpoints.query.persist_user_conversation_details")
+    mocker.patch("utils.endpoints.store_conversation_into_cache")
+    
+    # Mock request state
+    dummy_request.state.authorized_actions = []

     with pytest.raises(HTTPException) as exc_info:
         await query_endpoint_handler_v2(
             dummy_request, query_request=query_request, auth=MOCK_AUTH
         )
     assert exc_info.value.status_code == status.HTTP_429_TOO_MANY_REQUESTS
     detail = exc_info.value.detail
     assert isinstance(detail, dict)
     assert detail["response"] == "Model quota exceeded"  # type: ignore
     assert "gpt-4-turbo" in detail["cause"]  # type: ignore
🧹 Nitpick comments (1)
src/app/endpoints/streaming_query.py (1)

934-942: Consider using QuotaExceededResponse for consistency.

The exception handler manually constructs the detail dictionary, but the codebase has introduced QuotaExceededResponse specifically for this purpose. While the current implementation works, using the response model would ensure consistency with the OpenAPI schema and make the response message match the documented examples.

Consider this alternative approach:

     except RateLimitError as e:
         used_model = getattr(e, "model", "unknown")
+        # Note: QuotaExceededResponse expects user_id but we're handling model quotas
+        # The manual detail construction is acceptable here for model-level quotas
         raise HTTPException(
             status_code=status.HTTP_429_TOO_MANY_REQUESTS,
             detail={
                 "response": "Model quota exceeded",
                 "cause": f"The token quota for model {used_model} has been exceeded.",
             },
         ) from e

Alternatively, if you want to use the response model consistently, you could extend QuotaExceededResponse with a factory method as mentioned in past review comments (see src/models/responses.py).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b933b4 and adc06b4.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • docs/openapi.json (15 hunks)
  • pyproject.toml (1 hunks)
  • src/app/endpoints/query.py (4 hunks)
  • src/app/endpoints/query_v2.py (2 hunks)
  • src/app/endpoints/streaming_query.py (4 hunks)
  • src/models/responses.py (1 hunks)
  • tests/unit/app/endpoints/test_query.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (3 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/unit/app/endpoints/test_query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/query.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-18T10:58:14.951Z
Learnt from: matysek
Repo: lightspeed-core/lightspeed-stack PR: 292
File: pyproject.toml:47-47
Timestamp: 2025-08-18T10:58:14.951Z
Learning: psycopg2-binary is required by some llama-stack providers in the lightspeed-stack project, so it cannot be replaced with psycopg v3 or moved to optional dependencies without breaking llama-stack functionality.

Applied to files:

  • pyproject.toml
🧬 Code graph analysis (3)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (1)
  • QuotaExceededResponse (1145-1210)
tests/unit/app/endpoints/test_streaming_query.py (3)
tests/unit/app/endpoints/test_query.py (1)
  • test_query_endpoint_quota_exceeded (2268-2307)
tests/unit/app/endpoints/test_query_v2.py (1)
  • test_query_endpoint_quota_exceeded (447-479)
src/app/endpoints/streaming_query.py (1)
  • streaming_query_endpoint_handler (709-957)
tests/unit/app/endpoints/test_query_v2.py (3)
tests/unit/app/endpoints/test_query.py (2)
  • test_query_endpoint_quota_exceeded (2268-2307)
  • dummy_request (60-69)
src/models/requests.py (1)
  • QueryRequest (73-225)
src/app/endpoints/query_v2.py (1)
  • query_endpoint_handler_v2 (284-306)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-pr
  • GitHub Check: e2e_tests (azure)
  • GitHub Check: e2e_tests (ci)
🔇 Additional comments (11)
tests/unit/app/endpoints/test_streaming_query.py (2)

9-9: LGTM!

The import of RateLimitError from litellm.exceptions is correctly placed and necessary for the quota-exceeded test scenarios.


1799-1841: Test correctly matches implementation—no issues found.

The verification confirms the test structure and mocks accurately reflect the actual code path:

  • RateLimitError is properly caught at line 934 of streaming_query.py
  • The model is extracted via getattr(e, "model", "unknown") and will correctly retrieve "gpt-4-turbo" from the mock
  • The HTTPException detail structure matches the test assertions exactly
  • All assertions (status_code, detail["response"], and detail["cause"] containing "gpt-4-turbo") will pass correctly
src/app/endpoints/streaming_query.py (3)

11-11: LGTM!

The import of RateLimitError is correctly placed and necessary for quota exception handling.


52-56: LGTM!

The imports for response models are correctly structured and include the new QuotaExceededResponse for quota handling.


112-115: LGTM!

The 429 response mapping is correctly added with appropriate description and model reference.

tests/unit/app/endpoints/test_query_v2.py (2)

5-5: LGTM!

The import of RateLimitError from litellm.exceptions is correctly placed and necessary for the quota-exceeded test.


22-28: LGTM!

The MOCK_AUTH constant is well-defined and properly uses a valid UUID format for the user ID.

docs/openapi.json (4)

1232-1318: New v2/query endpoint documentation is complete and properly structured.

The /v2/query endpoint is well-documented with operation ID, proper request/response schemas, comprehensive error responses (400/403/429/500/422), and example payloads. The endpoint mirrors v1/query with the 429 quota-exceeded response properly integrated.


693-722: Added authentication error responses (400/401/403) to PUT /v1/feedback/status.

The PUT endpoint now documents 400 (missing/invalid credentials), 401 (invalid bearer token), and 403 (permission denied) error responses. Verify these error cases are actually caught and handled in the implementation, or clarify why feedback status updates require these authentication checks when the GET endpoint (line 651) does not.


1576-1643: Consolidated authentication/authorization configuration schemas verified as correctly implemented.

Python model definitions in src/models/config.py use the consolidated schema names (AuthenticationConfiguration, AuthorizationConfiguration, JwtConfiguration, JwkConfiguration) without any lingering -Input or -Output variants. All references throughout the codebase consistently use these consolidated names. The OpenAPI schema and Python models are aligned.


378-387: Quota-exceeded error handling is properly implemented across all three query endpoints.

Verification confirms:

  • v1/query (query.py): Directly catches RateLimitError and raises HTTP 429 ✓
  • v1/streaming_query (streaming_query.py): Directly catches RateLimitError and raises HTTP 429 ✓
  • v2/query (query_v2.py): Delegates to query_endpoint_handler_base (query.py), which catches RateLimitError and raises HTTP 429 ✓

All three endpoints consistently map quota errors to HTTP 429 responses with proper QuotaExceededResponse schema, matching the OpenAPI documentation. Unit tests exist for all three endpoints validating the 429 response.

@asimurka asimurka force-pushed the exceeded_quota_error_handling branch 2 times, most recently from 258c023 to ba36edc Compare November 10, 2025 12:14
@asimurka asimurka force-pushed the exceeded_quota_error_handling branch from ba36edc to 3ed6095 Compare November 11, 2025 08:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/unit/app/endpoints/test_query_v2.py (1)

446-479: Fix v2 quota-exceeded test: missing patches and arg cause false failures

The test omits required setup used by the passing v2 test (lines 368-405): configuration, quotas, models.list, evaluate_model_hints, request.state.authorized_actions, and get_session. It also doesn’t pass mcp_headers, so Depends won’t resolve in direct calls.

Apply this diff:

 @pytest.mark.asyncio
 async def test_query_endpoint_quota_exceeded(
     mocker: MockerFixture, dummy_request: Request
 ) -> None:
     """Test that query endpoint raises HTTP 429 when model quota is exceeded."""
+    # Mock configuration (avoid "configuration not loaded")
+    mock_config = mocker.Mock()
+    mock_config.llama_stack_configuration = mocker.Mock()
+    mock_config.quota_limiters = []
+    mocker.patch("app.endpoints.query_v2.configuration", mock_config)
+
     query_request = QueryRequest(
         query="What is OpenStack?",
         provider="openai",
         model="gpt-4-turbo",
-        attachments=[],
     )  # type: ignore
     mock_client = mocker.AsyncMock()
+    # models.list is awaited by the handler before select_model_and_provider_id
+    mock_client.models.list = mocker.AsyncMock(return_value=[mocker.Mock()])
     mock_client.responses.create.side_effect = RateLimitError(
         model="gpt-4-turbo", llm_provider="openai", message=""
     )
+    # Keep model/provider selection deterministic
+    mocker.patch("app.endpoints.query.evaluate_model_hints", return_value=(None, None))
     mocker.patch(
         "app.endpoints.query.select_model_and_provider_id",
         return_value=("openai/gpt-4-turbo", "gpt-4-turbo", "openai"),
     )
     mocker.patch("app.endpoints.query.validate_model_provider_override")
     mocker.patch(
         "client.AsyncLlamaStackClientHolder.get_client",
         return_value=mock_client,
     )
+    # Quota and DB dependencies used by the base handler
+    mocker.patch("utils.quota.check_tokens_available")
+    mocker.patch("utils.quota.consume_tokens")
+    mocker.patch("utils.quota.get_available_quotas", return_value={})
+    mocker.patch("app.endpoints.query.get_session")
+
+    # Request state required by validate_model_provider_override
+    dummy_request.state.authorized_actions = []
 
     with pytest.raises(HTTPException) as exc_info:
         await query_endpoint_handler_v2(
-            dummy_request, query_request=query_request, auth=MOCK_AUTH
+            dummy_request,
+            query_request=query_request,
+            auth=MOCK_AUTH,
+            mcp_headers={},  # bypass Depends in direct call
         )
     assert exc_info.value.status_code == status.HTTP_429_TOO_MANY_REQUESTS
     detail = exc_info.value.detail
     assert isinstance(detail, dict)
     assert detail["response"] == "Model quota exceeded"  # type: ignore
     assert "gpt-4-turbo" in detail["cause"]  # type: ignore
🧹 Nitpick comments (4)
tests/unit/app/endpoints/test_query_v2.py (1)

22-28: Local MOCK_AUTH is fine; consider a shared fixture to avoid duplication

If multiple test modules define MOCK_AUTH, move it to a conftest.py fixture/constants module.

src/app/endpoints/streaming_query.py (2)

112-115: 429 response wired into OpenAPI

Good addition. Consider aligning the “response” message with runtime text (“Model quota exceeded” vs “The quota has been exceeded”) for consistency across endpoints and docs.


934-942: Improve 429 handling: add metric and optional Retry-After header

Increment failure metric on rate limits and include Retry-After when available from the exception. Keep payload unchanged to satisfy existing tests.

-    except RateLimitError as e:
-        used_model = getattr(e, "model", "unknown")
-        raise HTTPException(
-            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
-            detail={
-                "response": "Model quota exceeded",
-                "cause": f"The token quota for model {used_model} has been exceeded.",
-            },
-        ) from e
+    except RateLimitError as e:
+        used_model = getattr(e, "model", "unknown")
+        # Record as failure for observability
+        metrics.llm_calls_failures_total.inc()
+        # Some providers expose retry interval; attach when present
+        retry_after = getattr(e, "retry_after", None)
+        headers = {"Retry-After": str(retry_after)} if retry_after else None
+        raise HTTPException(
+            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
+            detail={
+                "response": "Model quota exceeded",
+                "cause": f"The token quota for model {used_model} has been exceeded.",
+            },
+            headers=headers,
+        ) from e
docs/openapi.json (1)

3610-3666: QuotaExceededResponse schema is clear; unify example phrasing

Examples mix “The quota has been exceeded” and “The model quota has been exceeded”. Pick one canonical phrasing or document both intentionally, then align handlers/tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adc06b4 and 3ed6095.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • docs/openapi.json (4 hunks)
  • pyproject.toml (1 hunks)
  • src/app/endpoints/query.py (4 hunks)
  • src/app/endpoints/query_v2.py (2 hunks)
  • src/app/endpoints/streaming_query.py (4 hunks)
  • src/models/responses.py (1 hunks)
  • tests/unit/app/endpoints/test_query.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (3 hunks)
  • tests/unit/app/endpoints/test_streaming_query.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • tests/unit/app/endpoints/test_query.py
  • src/app/endpoints/query.py
  • src/models/responses.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • src/app/endpoints/query_v2.py
  • pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (2)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (1)
  • QuotaExceededResponse (1145-1210)
tests/unit/app/endpoints/test_query_v2.py (4)
tests/unit/app/endpoints/test_query.py (2)
  • test_query_endpoint_quota_exceeded (2268-2307)
  • dummy_request (60-69)
tests/unit/app/endpoints/test_streaming_query.py (1)
  • test_query_endpoint_quota_exceeded (1800-1839)
src/models/requests.py (1)
  • QueryRequest (73-225)
src/app/endpoints/query_v2.py (1)
  • query_endpoint_handler_v2 (284-306)
🔇 Additional comments (3)
tests/unit/app/endpoints/test_query_v2.py (1)

5-5: Import looks good

Using litellm.exceptions.RateLimitError for simulation is correct.

src/app/endpoints/streaming_query.py (2)

11-11: LGTM on importing RateLimitError

Matches the new 429 mapping below.


52-56: OpenAPI model import OK

Importing QuotaExceededResponse for responses mapping is correct.

Comment on lines +378 to +387
"429": {
"description": "The quota has been exceeded",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/QuotaExceededResponse"
}
}
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

429 responses added across endpoints

Good. Ensure the documented “description”/examples match the actual response.detail.response used at runtime (“Model quota exceeded” in code/tests vs “The quota has been exceeded” here).

Run this to check for message mismatches:

Also applies to: 474-483, 1279-1288


🏁 Script executed:

#!/bin/bash
# Find runtime 429 messages vs docs wording
rg -n -C1 -S "Model quota exceeded|The quota has been exceeded" --type py --type json

Length of output: 6321


429 response messages in docs/openapi.json do not match actual runtime behavior

The OpenAPI documentation documents "The quota has been exceeded" but the actual endpoints return "Model quota exceeded". Update the description and examples in docs/openapi.json at lines 378-387, 474-483, and 1279-1288 to use "Model quota exceeded" to match the actual response sent by src/app/endpoints/streaming_query.py:939 and src/app/endpoints/query.py:435, as confirmed by tests.

🤖 Prompt for AI Agents
In docs/openapi.json around lines 378-387, 474-483, and 1279-1288, the 429
response descriptions and examples currently read "The quota has been exceeded"
but must match runtime behavior which returns "Model quota exceeded"; update the
description fields and any example values at those line ranges to use the exact
text "Model quota exceeded" (and adjust any surrounding punctuation/formatting
to remain valid JSON), ensuring the schema references remain unchanged.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sounds ok

@tisnik
Copy link
Contributor

tisnik commented Nov 11, 2025

/ok-to-test

@tisnik
Copy link
Contributor

tisnik commented Nov 11, 2025

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants