LCORE-410: Model quota limit exception handling #772

asimurka · 2025-11-10T09:43:44Z

Description

This PR adds exception handling for LLM quota limit exceedances and introduces a unified custom response model for all quota-related errors, including model-level and internal quota exceedances.

Related Tickets & Documents

Related Issue #
Closes # LCORE-410

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Query endpoints (v1, v2, streaming) now return standardized quota-exceeded errors (HTTP 429) with structured detail and examples; added public QuotaExceededResponse.
Documentation
- OpenAPI updated to document 429 quota responses and include example quota-failure scenarios across relevant endpoints.
Tests
- Added unit tests confirming 429 quota-exceeded behavior for query and streaming endpoints.

coderabbitai · 2025-11-10T09:43:52Z

Walkthrough

Adds a new public QuotaExceededResponse schema and wires HTTP 429 quota-exceeded handling into v1/v2 query and streaming endpoints by catching RateLimitError and returning QuotaExceededResponse; updates OpenAPI docs and tests; moves litellm to top-level dependencies.

Changes

Cohort / File(s)	Change Summary
OpenAPI Schema & Docs `docs/openapi.json`	Added public `QuotaExceededResponse` schema (with examples) and added 429 responses referencing it to `/v1/query`, `/v1/streaming_query`, and `/v2/query`; updated response mappings accordingly.
Query Endpoints `src/app/endpoints/query.py`, `src/app/endpoints/query_v2.py`, `src/app/endpoints/streaming_query.py`	Imported `RateLimitError` and `QuotaExceededResponse`; added 429 entries to endpoint response definitions; catch `RateLimitError` in handlers/generators and translate to HTTP 429 with a `QuotaExceededResponse` payload.
Response Models `src/models/responses.py`	Added `QuotaExceededResponse(AbstractErrorResponse)` with constructor `def __init__(self, user_id: str, model_name: str, limit: int)` and `model_config` examples documenting quota-exceeded scenarios.
Unit Tests `tests/unit/app/endpoints/test_query.py`, `tests/unit/app/endpoints/test_query_v2.py`, `tests/unit/app/endpoints/test_streaming_query.py`	Added tests `test_query_endpoint_quota_exceeded` that mock `RateLimitError` to assert HTTP 429 and quota detail/cause; added `RateLimitError` imports. Note: `test_query_v2.py` contains a duplicated test and redundant import.
Project Config `pyproject.toml`	Moved `litellm>=1.75.5.post1` from the `llslibdev` group to top-level dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant E as Query Endpoint (v1/v2/streaming)
    participant A as Handler/Agent
    participant L as LLM Provider
    participant H as Error Handler

    C->>E: POST /v1/query or /v2/query or start streaming
    E->>A: process request / create turn
    A->>L: invoke model
    alt RateLimitError raised
        L-->>A: raises RateLimitError
        A-->>E: propagate RateLimitError
        E->>H: catch RateLimitError
        H-->>E: build QuotaExceededResponse(user_id, model_name, limit)
        E-->>C: HTTP 429 {"detail": "...", "cause": "...model_name..."}
    else Success
        L-->>A: model response
        A-->>E: normal response
        E-->>C: HTTP 200 / stream
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Extra attention:
- src/app/endpoints/query.py — ensure both catch sites consistently translate RateLimitError to 429 and payload shape matches QuotaExceededResponse.
- src/app/endpoints/streaming_query.py — verify generator-level exception handling and streaming teardown on 429.
- docs/openapi.json — confirm schema references and examples are valid JSON Schema and referenced paths.
- tests/unit/app/endpoints/test_query_v2.py — remove duplicated test and redundant imports; ensure mocks produce the same QuotaExceededResponse shape.

Possibly related PRs

Base implementation of non-streaming Responses API #753 — Adds/changes QuotaExceededResponse and 429 handling in the same endpoint modules and OpenAPI schemas; likely directly related to these changes.

Suggested labels

ok-to-test

Poem

🐰 I hopped into code at break of day,
Found tokens trimmed and models kept at bay,
I caught RateLimit with a gentle bite,
Returned a friendly 429 light,
Now quotas sleep beneath moonlight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly and concisely summarizes the main change: adding model quota limit exception handling across the API endpoints.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-11-10T09:43:54Z

Hi @asimurka. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

src/models/responses.py (1)
1145-1207: New QuotaExceededResponse: unify message text; remove trailing whitespace

Unify the detail.response wording with your examples and endpoint responses. Pick one canonical phrase and use it everywhere. For model-specific cases, consider a helper like QuotaExceededResponse.model_limit(model) to format cause consistently.

Strip trailing whitespace at Line 1159 to fix linter.
-                response="Quota exceeded",
+                response="The quota has been exceeded",
src/app/endpoints/streaming_query.py (1)

930-937: Translate RateLimitError to HTTP 429 — consider message consistency and mid‑stream handling

The response uses “Model quota exceeded”; schemas and models use variants. Consider standardizing.

Optional: if RateLimitError can occur after streaming starts, catch it inside response_generator and emit an OLS “error” SSE to avoid abrupt stream termination.
tests/unit/app/endpoints/test_query.py (1)
2267-2308: Unify patch target with other tests; remove trailing whitespace

All other tests in the file patch client.AsyncLlamaStackClientHolder.get_client (lines 191, 1445, 1474, 1526, 1585), but this test uses app.endpoints.query.AsyncLlamaStackClientHolder.get_client. Update to match the consistent pattern. Trailing whitespace confirmed on line 2298 (blank line) and lines 2301–2302 (after commas).

Apply:
-    mocker.patch(
-        "app.endpoints.query.AsyncLlamaStackClientHolder.get_client",
-        return_value=mock_client
-    )
-    mocker.patch(
-        "app.endpoints.query.handle_mcp_headers_with_toolgroups",
-        return_value={}
-    )
-    
-    with pytest.raises(HTTPException) as exc_info:
-        await query_endpoint_handler(
-            dummy_request, 
-            query_request=query_request, 
-            auth=MOCK_AUTH
+    mocker.patch(
+        "client.AsyncLlamaStackClientHolder.get_client",
+        return_value=mock_client,
+    )
+    mocker.patch(
+        "app.endpoints.query.handle_mcp_headers_with_toolgroups",
+        return_value={},
+    )
+
+    with pytest.raises(HTTPException) as exc_info:
+        await query_endpoint_handler(
+            dummy_request,
+            query_request=query_request,
+            auth=MOCK_AUTH

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a5255a and c9f1325.

📒 Files selected for processing (8)

docs/openapi.json (15 hunks)
src/app/endpoints/query.py (4 hunks)
src/app/endpoints/query_v2.py (2 hunks)
src/app/endpoints/streaming_query.py (4 hunks)
src/models/responses.py (1 hunks)
tests/unit/app/endpoints/test_query.py (2 hunks)
tests/unit/app/endpoints/test_query_v2.py (3 hunks)
tests/unit/app/endpoints/test_streaming_query.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

tests/unit/app/endpoints/test_streaming_query.py (4)

tests/unit/app/endpoints/test_query.py (1)

test_query_endpoint_quota_exceeded (2268-2308)

tests/unit/app/endpoints/test_query_v2.py (1)

test_query_endpoint_quota_exceeded (448-484)

src/models/requests.py (1)

QueryRequest (73-225)

src/app/endpoints/streaming_query.py (1)

streaming_query_endpoint_handler (705-952)

src/app/endpoints/query_v2.py (1)

src/models/responses.py (1)

QuotaExceededResponse (1145-1207)

src/app/endpoints/query.py (1)

src/models/responses.py (1)

QuotaExceededResponse (1145-1207)

tests/unit/app/endpoints/test_query.py (4)

tests/unit/app/endpoints/test_query_v2.py (2)

test_query_endpoint_quota_exceeded (448-484)

dummy_request (33-36)

tests/unit/app/endpoints/test_streaming_query.py (1)

test_query_endpoint_quota_exceeded (1800-1841)

src/models/requests.py (1)

QueryRequest (73-225)

src/app/endpoints/query.py (1)

query_endpoint_handler (442-464)

src/app/endpoints/streaming_query.py (1)

src/models/responses.py (1)

QuotaExceededResponse (1145-1207)

tests/unit/app/endpoints/test_query_v2.py (3)

tests/unit/app/endpoints/test_query.py (2)

test_query_endpoint_quota_exceeded (2268-2308)

dummy_request (60-69)

src/models/requests.py (1)

QueryRequest (73-225)

src/app/endpoints/query_v2.py (1)

query_endpoint_handler_v2 (284-306)

🪛 GitHub Actions: Black

tests/unit/app/endpoints/test_streaming_query.py