-
Notifications
You must be signed in to change notification settings - Fork 51
LCORE-410: Model quota limit exception handling #772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LCORE-410: Model quota limit exception handling #772
Conversation
WalkthroughAdds a new public QuotaExceededResponse schema and wires HTTP 429 quota-exceeded handling into v1/v2 query and streaming endpoints by catching RateLimitError and returning QuotaExceededResponse; updates OpenAPI docs and tests; moves litellm to top-level dependencies. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant E as Query Endpoint (v1/v2/streaming)
participant A as Handler/Agent
participant L as LLM Provider
participant H as Error Handler
C->>E: POST /v1/query or /v2/query or start streaming
E->>A: process request / create turn
A->>L: invoke model
alt RateLimitError raised
L-->>A: raises RateLimitError
A-->>E: propagate RateLimitError
E->>H: catch RateLimitError
H-->>E: build QuotaExceededResponse(user_id, model_name, limit)
E-->>C: HTTP 429 {"detail": "...", "cause": "...model_name..."}
else Success
L-->>A: model response
A-->>E: normal response
E-->>C: HTTP 200 / stream
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @asimurka. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2b65716 to
c9f1325
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (3)
src/models/responses.py (1)
1145-1207: New QuotaExceededResponse: unify message text; remove trailing whitespace
- Unify the
detail.responsewording with your examples and endpoint responses. Pick one canonical phrase and use it everywhere. For model-specific cases, consider a helper likeQuotaExceededResponse.model_limit(model)to format cause consistently.- Strip trailing whitespace at Line 1159 to fix linter.
- response="Quota exceeded", + response="The quota has been exceeded",src/app/endpoints/streaming_query.py (1)
930-937: Translate RateLimitError to HTTP 429 — consider message consistency and mid‑stream handling
- The response uses “Model quota exceeded”; schemas and models use variants. Consider standardizing.
- Optional: if RateLimitError can occur after streaming starts, catch it inside
response_generatorand emit an OLS “error” SSE to avoid abrupt stream termination.tests/unit/app/endpoints/test_query.py (1)
2267-2308: Unify patch target with other tests; remove trailing whitespaceAll other tests in the file patch
client.AsyncLlamaStackClientHolder.get_client(lines 191, 1445, 1474, 1526, 1585), but this test usesapp.endpoints.query.AsyncLlamaStackClientHolder.get_client. Update to match the consistent pattern. Trailing whitespace confirmed on line 2298 (blank line) and lines 2301–2302 (after commas).Apply:
- mocker.patch( - "app.endpoints.query.AsyncLlamaStackClientHolder.get_client", - return_value=mock_client - ) - mocker.patch( - "app.endpoints.query.handle_mcp_headers_with_toolgroups", - return_value={} - ) - - with pytest.raises(HTTPException) as exc_info: - await query_endpoint_handler( - dummy_request, - query_request=query_request, - auth=MOCK_AUTH + mocker.patch( + "client.AsyncLlamaStackClientHolder.get_client", + return_value=mock_client, + ) + mocker.patch( + "app.endpoints.query.handle_mcp_headers_with_toolgroups", + return_value={}, + ) + + with pytest.raises(HTTPException) as exc_info: + await query_endpoint_handler( + dummy_request, + query_request=query_request, + auth=MOCK_AUTH
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/openapi.json(15 hunks)src/app/endpoints/query.py(4 hunks)src/app/endpoints/query_v2.py(2 hunks)src/app/endpoints/streaming_query.py(4 hunks)src/models/responses.py(1 hunks)tests/unit/app/endpoints/test_query.py(2 hunks)tests/unit/app/endpoints/test_query_v2.py(3 hunks)tests/unit/app/endpoints/test_streaming_query.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/unit/app/endpoints/test_streaming_query.py (4)
tests/unit/app/endpoints/test_query.py (1)
test_query_endpoint_quota_exceeded(2268-2308)tests/unit/app/endpoints/test_query_v2.py (1)
test_query_endpoint_quota_exceeded(448-484)src/models/requests.py (1)
QueryRequest(73-225)src/app/endpoints/streaming_query.py (1)
streaming_query_endpoint_handler(705-952)
src/app/endpoints/query_v2.py (1)
src/models/responses.py (1)
QuotaExceededResponse(1145-1207)
src/app/endpoints/query.py (1)
src/models/responses.py (1)
QuotaExceededResponse(1145-1207)
tests/unit/app/endpoints/test_query.py (4)
tests/unit/app/endpoints/test_query_v2.py (2)
test_query_endpoint_quota_exceeded(448-484)dummy_request(33-36)tests/unit/app/endpoints/test_streaming_query.py (1)
test_query_endpoint_quota_exceeded(1800-1841)src/models/requests.py (1)
QueryRequest(73-225)src/app/endpoints/query.py (1)
query_endpoint_handler(442-464)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (1)
QuotaExceededResponse(1145-1207)
tests/unit/app/endpoints/test_query_v2.py (3)
tests/unit/app/endpoints/test_query.py (2)
test_query_endpoint_quota_exceeded(2268-2308)dummy_request(60-69)src/models/requests.py (1)
QueryRequest(73-225)src/app/endpoints/query_v2.py (1)
query_endpoint_handler_v2(284-306)
🪛 GitHub Actions: Black
tests/unit/app/endpoints/test_streaming_query.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
src/app/endpoints/query_v2.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
src/models/responses.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
src/app/endpoints/query.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
tests/unit/app/endpoints/test_query.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
src/app/endpoints/streaming_query.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
tests/unit/app/endpoints/test_query_v2.py
[error] Black formatting check failed. 7 files would be reformatted by 'black'. Step: uv tool run black --check .
🪛 GitHub Actions: Integration tests
src/app/endpoints/query.py
[error] 11-11: ModuleNotFoundError: No module named 'litellm'. The OpenAPI/test fixtures import litellm; ensure the dependency is installed (e.g., add to requirements and reinstall).
🪛 GitHub Actions: Python linter
tests/unit/app/endpoints/test_streaming_query.py
[error] 1831-1831: Trailing whitespace (C0303).
[error] 1834-1834: Trailing whitespace (C0303).
[error] 1835-1835: Trailing whitespace (C0303).
src/models/responses.py
[error] 1159-1159: Trailing whitespace (C0303) in module.
tests/unit/app/endpoints/test_query.py
[error] 2298-2298: Trailing whitespace (C0303).
[error] 2301-2301: Trailing whitespace (C0303).
[error] 2302-2302: Trailing whitespace (C0303).
tests/unit/app/endpoints/test_query_v2.py
[error] 474-474: Trailing whitespace (C0303).
[error] 477-477: Trailing whitespace (C0303).
[error] 478-478: Trailing whitespace (C0303).
[error] 11-11: Reimport 'MockerFixture' (reimported). (W0404)
[error] 11-11: Imports from package pytest_mock are not grouped (C0412).
🪛 GitHub Actions: Ruff
tests/unit/app/endpoints/test_query_v2.py
[error] 7-11: ruff check failed: F811 redefinition of unused 'MockerFixture'. The name 'MockerFixture' is defined previously in this file. Fixable with '--fix'. Command: 'uv tool run ruff check . --per-file-ignores=tests/:S101 --per-file-ignores=scripts/:S101'
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: e2e_tests (ci)
- GitHub Check: e2e_tests (azure)
🔇 Additional comments (6)
tests/unit/app/endpoints/test_query.py (1)
13-13: Import OKUsing litellm.exceptions.RateLimitError in tests is appropriate for simulating provider rate limits.
tests/unit/app/endpoints/test_query_v2.py (2)
23-29: MOCK_AUTH constant: OKConsistent with other tests; good reuse.
447-484: Revert patch target recommendation; current path is correctThe patch target in the test is already correct. Since
app.endpoints.queryimports the class directly (from client import AsyncLlamaStackClientHolder), you must patch it where it's used:mocker.patch( "app.endpoints.query.AsyncLlamaStackClientHolder.get_client", return_value=mock_client )Patching
client.AsyncLlamaStackClientHolder.get_clientinstead would not affect the imported reference in the query module and the mock would not be applied. Do not change the patch target.Trim trailing whitespace on lines 474, 477, 478 as flagged by CI.
Likely an incorrect or invalid review comment.
tests/unit/app/endpoints/test_streaming_query.py (1)
9-9: Import OKUsing RateLimitError to simulate provider throttling is correct.
src/app/endpoints/streaming_query.py (1)
108-111: Expose 429 in OpenAPI — goodAdding 429 with QuotaExceededResponse improves docs and client generation.
src/app/endpoints/query_v2.py (1)
63-66: Now I need to examine the QuotaExceededResponse model and understand what it actually returns:Let me inspect the codebase directly to understand the QuotaExceededResponse model:<!-- [scratchpad_start]
[task review_file_query_v2 analysis_quota_messages final_analysis]
Progress: Web search didn't find QuotaExceededResponse directly, need direct codebase inspection now.[key facts from grep earlier]
- Tests expect: detail["response"] == "Model quota exceeded"
- OpenAPI description says: "The quota has been exceeded"
- src/models/responses.py has these messages at lines 1152-1195
- src/utils/quota.py line 70 sets: message = "The quota has been exceeded"
- src/app/endpoints/*.py responses use: "Model quota exceeded" in actual returns
[critical observation]
- The QuotaExceededResponse model must have a "response" field
- Tests validate that field is "Model quota exceeded"
- OpenAPI description is "The quota has been exceeded"
- These are DIFFERENT
[conclusion so far]
- OpenAPI description ≠ actual response message field
- This is worth checking but may not be a bug
- However, if OpenAPI is supposed to document what's returned, there's a mismatch
[final determination]
- Cannot verify without seeing the actual QuotaExceededResponse model code
- The review comment's concern about aligning message text across code/tests/examples is VALID
- But the question of whether OpenAPI description should match response field is nuanced
- Original review suggests standardizing across all uses, which is reasonable
[action]
- Mark as since I cannot definitively determine if this is an issue
- The evidence shows message inconsistency exists but interpretation depends on codebase design
</scratchpad_end] -->Based on my analysis of the search results from the codebase, I can see there are quota message inconsistencies:
From the grep output:
- Tests consistently expect:
"Model quota exceeded"- OpenAPI descriptions use:
"The quota has been exceeded"- Schema examples show:
"Quota exceeded"and"The quota has been exceeded"The original review comment's concern about message alignment is valid. However, after researching OpenAPI standards, the OpenAPI description field is a human-readable explanation of the response that can contain details helpful to developers and end-users, and it's good practice to provide clear and concise descriptions. The description field and actual response body content can differ in purpose.
That said, the inconsistency across multiple locations (tests, actual responses, schema examples, and descriptions) warrants developer attention to ensure clarity and maintainability.
429 response mapping added — align wording across code and tests
The OpenAPI 429 mapping is correct. However, there's inconsistency in quota error messages across the codebase: tests expect "Model quota exceeded", but OpenAPI description and schema examples use "The quota has been exceeded". Consider standardizing the message text across endpoints, tests, and examples to match the actual response field value returned to clients, ensuring clarity and consistency in error documentation.
c9f1325 to
8b933b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/app/endpoints/streaming_query.py (1)
808-833: Handle RateLimitError inside the streaming generatorIf RateLimitError is raised during iteration, it escapes after the 200 response has started, resulting in a broken stream instead of a structured quota error. Catch it inside the generator and emit an SSE error event, then end the stream.
Apply:
- async for chunk in turn_response: - if chunk.event is None: - continue - p = chunk.event.payload - if p.event_type == "turn_complete": - summary.llm_response = interleaved_content_as_str( - p.turn.output_message.content - ) - latest_turn = p.turn - system_prompt = get_system_prompt(query_request, configuration) - try: - update_llm_token_count_from_turn( - p.turn, model_id, provider_id, system_prompt - ) - except Exception: # pylint: disable=broad-except - logger.exception("Failed to update token usage metrics") - elif p.event_type == "step_complete": - if p.step_details.step_type == "tool_execution": - summary.append_tool_calls_from_llama(p.step_details) - - for event in stream_build_event( - chunk, chunk_id, metadata_map, media_type, conversation_id - ): - chunk_id += 1 - yield event + try: + async for chunk in turn_response: + if chunk.event is None: + continue + p = chunk.event.payload + if p.event_type == "turn_complete": + summary.llm_response = interleaved_content_as_str( + p.turn.output_message.content + ) + latest_turn = p.turn + system_prompt = get_system_prompt(query_request, configuration) + try: + update_llm_token_count_from_turn( + p.turn, model_id, provider_id, system_prompt + ) + except Exception: # pylint: disable=broad-except + logger.exception("Failed to update token usage metrics") + elif p.event_type == "step_complete": + if p.step_details.step_type == "tool_execution": + summary.append_tool_calls_from_llama(p.step_details) + + for event in stream_build_event( + chunk, chunk_id, metadata_map, media_type, conversation_id + ): + chunk_id += 1 + yield event + except RateLimitError as e: + # Cannot change HTTP status mid-stream; emit SSE error and stop. + model_name = getattr(e, "model", "unknown") + logger.warning("Rate limited while streaming for model %s", model_name) + if media_type == MEDIA_TYPE_TEXT: + yield f"Error: Model quota exceeded: The token quota for model {model_name} has been exceeded." + else: + yield format_stream_data( + { + "event": "error", + "data": { + "status_code": 429, + "response": "Model quota exceeded", + "cause": f"The token quota for model {model_name} has been exceeded.", + }, + } + ) + returndocs/openapi.json (1)
971-987: Type mismatch: last_message_timestamp example vs schemaConversationData defines last_message_timestamp as number, but the example here uses an ISO string. Switch the example to a numeric epoch (e.g., 1704067200.0) or change the schema type.
- "last_message_timestamp": "2024-01-01T00:00:00Z" + "last_message_timestamp": 1704067200.0
🧹 Nitpick comments (4)
src/app/endpoints/streaming_query.py (1)
930-937: Harden RateLimitError handling and align message
- Use getattr for e.model to avoid attribute errors across versions.
- Consider aligning the response text with schema/examples ("Model quota exceeded").
- except RateLimitError as e: - raise HTTPException( - status_code=status.HTTP_429_TOO_MANY_REQUESTS, - detail={ - "response": "Model quota exceeded", - "cause": f"The token quota for model {e.model} has been exceeded.", - }, - ) from e + except RateLimitError as e: + model_name = getattr(e, "model", "unknown") + raise HTTPException( + status_code=status.HTTP_429_TOO_MANY_REQUESTS, + detail={ + "response": "Model quota exceeded", + "cause": f"The token quota for model {model_name} has been exceeded.", + }, + ) from edocs/openapi.json (2)
474-483: v1 streaming_query: clarify 429 semantics429 schema is correct for pre‑stream rejections. If rate limiting occurs mid‑stream, the server will emit an SSE "error" event with HTTP 200 (after adopting generator handling). Consider adding a short note to avoid client confusion.
3620-3676: Align QuotaExceededResponse examples with server wordingExamples use "The quota has been exceeded"/"The model quota has been exceeded" while code returns "Model quota exceeded". Unify to a single phrasing for clients.
- "response": "The quota has been exceeded" + "response": "Quota exceeded" @@ - "response": "The model quota has been exceeded" + "response": "Model quota exceeded"tests/unit/app/endpoints/test_query_v2.py (1)
22-28: Consider sharing MOCK_AUTH across test modules.This constant is duplicated in
tests/unit/app/endpoints/test_query.py(lines 51-56). Consider moving it to a shared test fixture or conftest to avoid duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/openapi.json(15 hunks)src/app/endpoints/query.py(4 hunks)src/app/endpoints/query_v2.py(2 hunks)src/app/endpoints/streaming_query.py(4 hunks)src/models/responses.py(1 hunks)tests/unit/app/endpoints/test_query.py(2 hunks)tests/unit/app/endpoints/test_query_v2.py(3 hunks)tests/unit/app/endpoints/test_streaming_query.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/unit/app/endpoints/test_streaming_query.py
- src/app/endpoints/query.py
- src/app/endpoints/query_v2.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/unit/app/endpoints/test_query_v2.py (3)
tests/unit/app/endpoints/test_query.py (2)
test_query_endpoint_quota_exceeded(2268-2308)dummy_request(60-69)src/models/requests.py (1)
QueryRequest(73-225)src/app/endpoints/query_v2.py (1)
query_endpoint_handler_v2(284-306)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (3)
ForbiddenResponse(1120-1142)UnauthorizedResponse(1094-1117)QuotaExceededResponse(1145-1207)
tests/unit/app/endpoints/test_query.py (3)
tests/unit/app/endpoints/test_query_v2.py (2)
test_query_endpoint_quota_exceeded(447-483)dummy_request(32-35)tests/unit/app/endpoints/test_streaming_query.py (1)
test_query_endpoint_quota_exceeded(1800-1841)src/app/endpoints/query.py (1)
query_endpoint_handler(442-464)
🔇 Additional comments (4)
src/app/endpoints/streaming_query.py (1)
108-111: Good: documented 429 with QuotaExceededResponse429 mapping added to the OpenAPI responses for streaming. LGTM.
docs/openapi.json (2)
378-387: v1 query: 429 quota response addedThe 429 path referencing QuotaExceededResponse is correct and matches the server behavior. LGTM.
1232-1317: Implementation verified and aligned with specAll verification points confirmed:
- Handler
query_endpoint_handler_v2exists and is routed via@router.post("/query", responses=query_v2_response)insrc/app/endpoints/query_v2.pyQuotaExceededResponseis imported and wired to the 429 response code in the same file- Test coverage includes
test_query_endpoint_quota_exceededwhich validates the 429 status and error response bodyThe implementation mirrors v1 and docs are consistent with the codebase.
tests/unit/app/endpoints/test_query.py (1)
2267-2308: LGTM! Quota-exceeded test for v1 endpoint is well-structured.The test correctly simulates a
RateLimitErrorfrom the agent and verifies the HTTP 429 response with appropriate error details.
d4ed9b2 to
adc06b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/models/responses.py (1)
1145-1211: Address wording inconsistency and unused constructor parameters.The class has several design issues:
Inconsistent response wording: The constructor uses "The quota has been exceeded" (line 1157), but endpoint code uses "Model quota exceeded" (streaming_query.py line 939), and examples include both variants (lines 1168, 1204).
Unused constructor parameters:
model_nameandlimitare marked as unused (lines 1151-1152), suggesting the constructor signature doesn't match the intended use cases.TODO indicates incomplete implementation: Line 1161 mentions factories are needed, confirming this is a work-in-progress.
Given the TODO comment and the fact that endpoints are manually constructing details rather than using this class, consider either:
Option 1: Update the constructor to support model-level quota errors:
def __init__( self, user_id: str, - model_name: str, # pylint: disable=unused-argument - limit: int, # pylint: disable=unused-argument + model_name: str | None = None, + limit: int | None = None, ): """Initialize a QuotaExceededResponse.""" + if model_name: + response_msg = "Model quota exceeded" + cause_msg = f"The token quota for model {model_name} has been exceeded." + else: + response_msg = "Quota exceeded" + cause_msg = f"User {user_id} has no available tokens." super().__init__( detail=DetailModel( - response="The quota has been exceeded", - cause=(f"User {user_id} has no available tokens."), + response=response_msg, + cause=cause_msg, ) )Option 2: Add a factory method as suggested in past reviews:
@classmethod def for_model(cls, model_name: str) -> "QuotaExceededResponse": """Create a quota response for model-level quota exceedance.""" obj = cls.__new__(cls) obj.detail = DetailModel( response="Model quota exceeded", cause=f"The token quota for model {model_name} has been exceeded.", ) return objThen endpoints can use:
raise HTTPException(status_code=429, detail=QuotaExceededResponse.for_model(used_model).dump_detail())tests/unit/app/endpoints/test_query_v2.py (1)
446-479: Fix incomplete mock setup to match successful test pattern.The test has several missing mocks compared to the successful test at lines 363-413, which will likely cause it to fail or not test the intended code path:
- Missing configuration mock (compare lines 368-371): The handler expects
configurationto be loaded- Missing quota function mocks (compare lines 397-399): Quota checks are performed before the RateLimitError is raised
- Missing request state (compare line 402): The handler accesses
request.state.authorized_actions- Missing database mock (compare line 394): The handler accesses the database session
- Missing other required mocks: Model hints, conversation ownership, etc.
Apply this diff to fix the mock setup:
@pytest.mark.asyncio async def test_query_endpoint_quota_exceeded( mocker: MockerFixture, dummy_request: Request ) -> None: """Test that query endpoint raises HTTP 429 when model quota is exceeded.""" + # Mock configuration + mock_config = mocker.Mock() + mock_config.llama_stack_configuration = mocker.Mock() + mock_config.quota_limiters = [] + mocker.patch("app.endpoints.query_v2.configuration", mock_config) + query_request = QueryRequest( query="What is OpenStack?", provider="openai", model="gpt-4-turbo", attachments=[], ) # type: ignore mock_client = mocker.AsyncMock() mock_client.responses.create.side_effect = RateLimitError( model="gpt-4-turbo", llm_provider="openai", message="" ) + mock_client.models.list = mocker.AsyncMock(return_value=[mocker.Mock()]) + + mocker.patch("app.endpoints.query.evaluate_model_hints", return_value=(None, None)) mocker.patch( "app.endpoints.query.select_model_and_provider_id", return_value=("openai/gpt-4-turbo", "gpt-4-turbo", "openai"), ) mocker.patch("app.endpoints.query.validate_model_provider_override") mocker.patch( "client.AsyncLlamaStackClientHolder.get_client", return_value=mock_client, ) + + # Mock quota functions + mocker.patch("utils.quota.check_tokens_available") + mocker.patch("utils.quota.consume_tokens") + mocker.patch("utils.quota.get_available_quotas", return_value={}) + + # Mock database and other functions + mocker.patch("app.endpoints.query.get_session") + mocker.patch("app.endpoints.query.is_transcripts_enabled", return_value=False) + mocker.patch("app.endpoints.query.persist_user_conversation_details") + mocker.patch("utils.endpoints.store_conversation_into_cache") + + # Mock request state + dummy_request.state.authorized_actions = [] with pytest.raises(HTTPException) as exc_info: await query_endpoint_handler_v2( dummy_request, query_request=query_request, auth=MOCK_AUTH ) assert exc_info.value.status_code == status.HTTP_429_TOO_MANY_REQUESTS detail = exc_info.value.detail assert isinstance(detail, dict) assert detail["response"] == "Model quota exceeded" # type: ignore assert "gpt-4-turbo" in detail["cause"] # type: ignore
🧹 Nitpick comments (1)
src/app/endpoints/streaming_query.py (1)
934-942: Consider using QuotaExceededResponse for consistency.The exception handler manually constructs the detail dictionary, but the codebase has introduced
QuotaExceededResponsespecifically for this purpose. While the current implementation works, using the response model would ensure consistency with the OpenAPI schema and make the response message match the documented examples.Consider this alternative approach:
except RateLimitError as e: used_model = getattr(e, "model", "unknown") + # Note: QuotaExceededResponse expects user_id but we're handling model quotas + # The manual detail construction is acceptable here for model-level quotas raise HTTPException( status_code=status.HTTP_429_TOO_MANY_REQUESTS, detail={ "response": "Model quota exceeded", "cause": f"The token quota for model {used_model} has been exceeded.", }, ) from eAlternatively, if you want to use the response model consistently, you could extend
QuotaExceededResponsewith a factory method as mentioned in past review comments (see src/models/responses.py).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
docs/openapi.json(15 hunks)pyproject.toml(1 hunks)src/app/endpoints/query.py(4 hunks)src/app/endpoints/query_v2.py(2 hunks)src/app/endpoints/streaming_query.py(4 hunks)src/models/responses.py(1 hunks)tests/unit/app/endpoints/test_query.py(2 hunks)tests/unit/app/endpoints/test_query_v2.py(3 hunks)tests/unit/app/endpoints/test_streaming_query.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/unit/app/endpoints/test_query.py
- src/app/endpoints/query_v2.py
- src/app/endpoints/query.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-18T10:58:14.951Z
Learnt from: matysek
Repo: lightspeed-core/lightspeed-stack PR: 292
File: pyproject.toml:47-47
Timestamp: 2025-08-18T10:58:14.951Z
Learning: psycopg2-binary is required by some llama-stack providers in the lightspeed-stack project, so it cannot be replaced with psycopg v3 or moved to optional dependencies without breaking llama-stack functionality.
Applied to files:
pyproject.toml
🧬 Code graph analysis (3)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (1)
QuotaExceededResponse(1145-1210)
tests/unit/app/endpoints/test_streaming_query.py (3)
tests/unit/app/endpoints/test_query.py (1)
test_query_endpoint_quota_exceeded(2268-2307)tests/unit/app/endpoints/test_query_v2.py (1)
test_query_endpoint_quota_exceeded(447-479)src/app/endpoints/streaming_query.py (1)
streaming_query_endpoint_handler(709-957)
tests/unit/app/endpoints/test_query_v2.py (3)
tests/unit/app/endpoints/test_query.py (2)
test_query_endpoint_quota_exceeded(2268-2307)dummy_request(60-69)src/models/requests.py (1)
QueryRequest(73-225)src/app/endpoints/query_v2.py (1)
query_endpoint_handler_v2(284-306)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (azure)
- GitHub Check: e2e_tests (ci)
🔇 Additional comments (11)
tests/unit/app/endpoints/test_streaming_query.py (2)
9-9: LGTM!The import of
RateLimitErrorfromlitellm.exceptionsis correctly placed and necessary for the quota-exceeded test scenarios.
1799-1841: Test correctly matches implementation—no issues found.The verification confirms the test structure and mocks accurately reflect the actual code path:
- RateLimitError is properly caught at line 934 of
streaming_query.py- The model is extracted via
getattr(e, "model", "unknown")and will correctly retrieve "gpt-4-turbo" from the mock- The HTTPException detail structure matches the test assertions exactly
- All assertions (
status_code,detail["response"], anddetail["cause"]containing "gpt-4-turbo") will pass correctlysrc/app/endpoints/streaming_query.py (3)
11-11: LGTM!The import of
RateLimitErroris correctly placed and necessary for quota exception handling.
52-56: LGTM!The imports for response models are correctly structured and include the new
QuotaExceededResponsefor quota handling.
112-115: LGTM!The 429 response mapping is correctly added with appropriate description and model reference.
tests/unit/app/endpoints/test_query_v2.py (2)
5-5: LGTM!The import of
RateLimitErrorfromlitellm.exceptionsis correctly placed and necessary for the quota-exceeded test.
22-28: LGTM!The
MOCK_AUTHconstant is well-defined and properly uses a valid UUID format for the user ID.docs/openapi.json (4)
1232-1318: New v2/query endpoint documentation is complete and properly structured.The /v2/query endpoint is well-documented with operation ID, proper request/response schemas, comprehensive error responses (400/403/429/500/422), and example payloads. The endpoint mirrors v1/query with the 429 quota-exceeded response properly integrated.
693-722: Added authentication error responses (400/401/403) to PUT /v1/feedback/status.The PUT endpoint now documents 400 (missing/invalid credentials), 401 (invalid bearer token), and 403 (permission denied) error responses. Verify these error cases are actually caught and handled in the implementation, or clarify why feedback status updates require these authentication checks when the GET endpoint (line 651) does not.
1576-1643: Consolidated authentication/authorization configuration schemas verified as correctly implemented.Python model definitions in
src/models/config.pyuse the consolidated schema names (AuthenticationConfiguration,AuthorizationConfiguration,JwtConfiguration,JwkConfiguration) without any lingering-Inputor-Outputvariants. All references throughout the codebase consistently use these consolidated names. The OpenAPI schema and Python models are aligned.
378-387: Quota-exceeded error handling is properly implemented across all three query endpoints.Verification confirms:
- v1/query (query.py): Directly catches
RateLimitErrorand raises HTTP 429 ✓- v1/streaming_query (streaming_query.py): Directly catches
RateLimitErrorand raises HTTP 429 ✓- v2/query (query_v2.py): Delegates to
query_endpoint_handler_base(query.py), which catchesRateLimitErrorand raises HTTP 429 ✓All three endpoints consistently map quota errors to HTTP 429 responses with proper
QuotaExceededResponseschema, matching the OpenAPI documentation. Unit tests exist for all three endpoints validating the 429 response.
258c023 to
ba36edc
Compare
ba36edc to
3ed6095
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/unit/app/endpoints/test_query_v2.py (1)
446-479: Fix v2 quota-exceeded test: missing patches and arg cause false failuresThe test omits required setup used by the passing v2 test (lines 368-405): configuration, quotas, models.list, evaluate_model_hints, request.state.authorized_actions, and get_session. It also doesn’t pass mcp_headers, so Depends won’t resolve in direct calls.
Apply this diff:
@pytest.mark.asyncio async def test_query_endpoint_quota_exceeded( mocker: MockerFixture, dummy_request: Request ) -> None: """Test that query endpoint raises HTTP 429 when model quota is exceeded.""" + # Mock configuration (avoid "configuration not loaded") + mock_config = mocker.Mock() + mock_config.llama_stack_configuration = mocker.Mock() + mock_config.quota_limiters = [] + mocker.patch("app.endpoints.query_v2.configuration", mock_config) + query_request = QueryRequest( query="What is OpenStack?", provider="openai", model="gpt-4-turbo", - attachments=[], ) # type: ignore mock_client = mocker.AsyncMock() + # models.list is awaited by the handler before select_model_and_provider_id + mock_client.models.list = mocker.AsyncMock(return_value=[mocker.Mock()]) mock_client.responses.create.side_effect = RateLimitError( model="gpt-4-turbo", llm_provider="openai", message="" ) + # Keep model/provider selection deterministic + mocker.patch("app.endpoints.query.evaluate_model_hints", return_value=(None, None)) mocker.patch( "app.endpoints.query.select_model_and_provider_id", return_value=("openai/gpt-4-turbo", "gpt-4-turbo", "openai"), ) mocker.patch("app.endpoints.query.validate_model_provider_override") mocker.patch( "client.AsyncLlamaStackClientHolder.get_client", return_value=mock_client, ) + # Quota and DB dependencies used by the base handler + mocker.patch("utils.quota.check_tokens_available") + mocker.patch("utils.quota.consume_tokens") + mocker.patch("utils.quota.get_available_quotas", return_value={}) + mocker.patch("app.endpoints.query.get_session") + + # Request state required by validate_model_provider_override + dummy_request.state.authorized_actions = [] with pytest.raises(HTTPException) as exc_info: await query_endpoint_handler_v2( - dummy_request, query_request=query_request, auth=MOCK_AUTH + dummy_request, + query_request=query_request, + auth=MOCK_AUTH, + mcp_headers={}, # bypass Depends in direct call ) assert exc_info.value.status_code == status.HTTP_429_TOO_MANY_REQUESTS detail = exc_info.value.detail assert isinstance(detail, dict) assert detail["response"] == "Model quota exceeded" # type: ignore assert "gpt-4-turbo" in detail["cause"] # type: ignore
🧹 Nitpick comments (4)
tests/unit/app/endpoints/test_query_v2.py (1)
22-28: Local MOCK_AUTH is fine; consider a shared fixture to avoid duplicationIf multiple test modules define MOCK_AUTH, move it to a conftest.py fixture/constants module.
src/app/endpoints/streaming_query.py (2)
112-115: 429 response wired into OpenAPIGood addition. Consider aligning the “response” message with runtime text (“Model quota exceeded” vs “The quota has been exceeded”) for consistency across endpoints and docs.
934-942: Improve 429 handling: add metric and optional Retry-After headerIncrement failure metric on rate limits and include Retry-After when available from the exception. Keep payload unchanged to satisfy existing tests.
- except RateLimitError as e: - used_model = getattr(e, "model", "unknown") - raise HTTPException( - status_code=status.HTTP_429_TOO_MANY_REQUESTS, - detail={ - "response": "Model quota exceeded", - "cause": f"The token quota for model {used_model} has been exceeded.", - }, - ) from e + except RateLimitError as e: + used_model = getattr(e, "model", "unknown") + # Record as failure for observability + metrics.llm_calls_failures_total.inc() + # Some providers expose retry interval; attach when present + retry_after = getattr(e, "retry_after", None) + headers = {"Retry-After": str(retry_after)} if retry_after else None + raise HTTPException( + status_code=status.HTTP_429_TOO_MANY_REQUESTS, + detail={ + "response": "Model quota exceeded", + "cause": f"The token quota for model {used_model} has been exceeded.", + }, + headers=headers, + ) from edocs/openapi.json (1)
3610-3666: QuotaExceededResponse schema is clear; unify example phrasingExamples mix “The quota has been exceeded” and “The model quota has been exceeded”. Pick one canonical phrasing or document both intentionally, then align handlers/tests.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
docs/openapi.json(4 hunks)pyproject.toml(1 hunks)src/app/endpoints/query.py(4 hunks)src/app/endpoints/query_v2.py(2 hunks)src/app/endpoints/streaming_query.py(4 hunks)src/models/responses.py(1 hunks)tests/unit/app/endpoints/test_query.py(2 hunks)tests/unit/app/endpoints/test_query_v2.py(3 hunks)tests/unit/app/endpoints/test_streaming_query.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/unit/app/endpoints/test_query.py
- src/app/endpoints/query.py
- src/models/responses.py
- tests/unit/app/endpoints/test_streaming_query.py
- src/app/endpoints/query_v2.py
- pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (2)
src/app/endpoints/streaming_query.py (1)
src/models/responses.py (1)
QuotaExceededResponse(1145-1210)
tests/unit/app/endpoints/test_query_v2.py (4)
tests/unit/app/endpoints/test_query.py (2)
test_query_endpoint_quota_exceeded(2268-2307)dummy_request(60-69)tests/unit/app/endpoints/test_streaming_query.py (1)
test_query_endpoint_quota_exceeded(1800-1839)src/models/requests.py (1)
QueryRequest(73-225)src/app/endpoints/query_v2.py (1)
query_endpoint_handler_v2(284-306)
🔇 Additional comments (3)
tests/unit/app/endpoints/test_query_v2.py (1)
5-5: Import looks goodUsing litellm.exceptions.RateLimitError for simulation is correct.
src/app/endpoints/streaming_query.py (2)
11-11: LGTM on importing RateLimitErrorMatches the new 429 mapping below.
52-56: OpenAPI model import OKImporting QuotaExceededResponse for responses mapping is correct.
| "429": { | ||
| "description": "The quota has been exceeded", | ||
| "content": { | ||
| "application/json": { | ||
| "schema": { | ||
| "$ref": "#/components/schemas/QuotaExceededResponse" | ||
| } | ||
| } | ||
| } | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
429 responses added across endpoints
Good. Ensure the documented “description”/examples match the actual response.detail.response used at runtime (“Model quota exceeded” in code/tests vs “The quota has been exceeded” here).
Run this to check for message mismatches:
Also applies to: 474-483, 1279-1288
🏁 Script executed:
#!/bin/bash
# Find runtime 429 messages vs docs wording
rg -n -C1 -S "Model quota exceeded|The quota has been exceeded" --type py --type jsonLength of output: 6321
429 response messages in docs/openapi.json do not match actual runtime behavior
The OpenAPI documentation documents "The quota has been exceeded" but the actual endpoints return "Model quota exceeded". Update the description and examples in docs/openapi.json at lines 378-387, 474-483, and 1279-1288 to use "Model quota exceeded" to match the actual response sent by src/app/endpoints/streaming_query.py:939 and src/app/endpoints/query.py:435, as confirmed by tests.
🤖 Prompt for AI Agents
In docs/openapi.json around lines 378-387, 474-483, and 1279-1288, the 429
response descriptions and examples currently read "The quota has been exceeded"
but must match runtime behavior which returns "Model quota exceeded"; update the
description fields and any example values at those line ranges to use the exact
text "Model quota exceeded" (and adjust any surrounding punctuation/formatting
to remain valid JSON), ensuring the schema references remain unchanged.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sounds ok
|
/ok-to-test |
|
/ok-to-test |
Description
This PR adds exception handling for LLM quota limit exceedances and introduces a unified custom response model for all quota-related errors, including model-level and internal quota exceedances.
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Tests