-
Notifications
You must be signed in to change notification settings - Fork 1.2k
refactor: remove dead inference API code and clean up imports #4093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eb81355 to
e919b53
Compare
|
These are relics of the older inference API... who's needing these still? I remember the eval API still using SamplingParams -- maybe that's the only reference? If that's the case, could we migrate that eval API to a more OpenAI compatible format? It is marked |
|
@ashwinb, you're right, most of these are actually being used in "dead" types. Let me see how much dead code I can remove and if it makes this functionally a no-op. Eval might be a tricky case but let me see what I can do. |
4b1d0bb to
a5d912a
Compare
|
@ashwinb let me know if this makes sense! it kills two birds with one stone: no more |
|
there were also a bunch of duplicate classes in |
ashwinb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this seems correct to do
src/llama_stack/providers/inline/inference/meta_reference/model_parallel.py
Outdated
Show resolved
Hide resolved
src/llama_stack/providers/utils/inference/litellm_openai_mixin.py
Outdated
Show resolved
Hide resolved
Delete ~1,300 lines of dead code from the old bespoke inference API that was replaced by OpenAI-only API. This includes removing unused type conversion functions, dead provider methods, and event_logger.py. Clean up imports across the codebase to remove references to deleted types. This eliminates unnecessary code and dependencies, helping isolate the API package as a self-contained module. This is the last interdependency between the .api package and "exterior" packages, meaning that now every other package in llama stack imports the API, not the other way around. The API is now self contained and can be moved into its own package. Signed-off-by: Charlie Doern <[email protected]>
- Add chat_completion() method to LlamaGenerator supporting OpenAI request format - Implement openai_chat_completion() in MetaReferenceInferenceImpl - Fix ModelRunner task dispatch to handle chat_completion tasks - Add convert_openai_message_to_raw_message() utility for message conversion - Add unit tests for message conversion and model-parallel dispatch - Remove unused CompletionRequestWithRawContent references Signed-off-by: Charlie Doern <[email protected]>
|
CI is hanging, trying to see if it is a flake. tests pass locally with record and replay. |
|
|
||
| # Get logits processor for response format | ||
| logits_processor = None | ||
| if request.response_format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request.response_format is a concrete OpenAIResponseFormatParam, not a dict, so this branch never runs and JSON schema requests silently fall back to unconstrained sampling. Could we inspect the actual variant (e.g., OpenAIResponseFormatJSONSchema) and pass its schema through to JsonSchemaResponseFormat like before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right . sorry about that. fixing this up now
| # Prepare sampling params | ||
| sampling_params = SamplingParams() | ||
| if request.temperature is not None or request.top_p is not None: | ||
| sampling_params.strategy = TopPSamplingStrategy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temperature=request.temperature or 1.0 (and the same for top_p) treats 0.0 as “unset”, so callers can’t ask for deterministic decoding—the code always bumps it back to 1. Likewise, when both fields are omitted we never inherit OpenAI’s defaults. Could we distinguish None vs real numbers so 0.0 and other values survive untouched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a great point. thanks!
| created=created, | ||
| model=params.model, | ||
| choices=[ | ||
| OpenAIChoice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openai_chat_completion always runs exactly one generation, flattens it into a single string, and returns one choice with a hard-coded finish_reason. means streaming responses, and tool-call outputs all silently stop working after this refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can fix this! However, I believe the meta_reference provider was not working before this PR. all methods in inference.py raised NotImplementedError @leseb raised a similar topic above.
ashwinb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments
This commit addresses comments regarding the
OpenAI chat completion implementation in the meta_reference provider.
Tool Augmentation
- Add `augment_raw_messages_for_tools()` to properly inject tool definitions into prompts
- Support model-family-specific tool formats:
* Llama 3.1/3.2 multimodal: JsonCustomToolGenerator with JSON format
* Llama 3.2/3.3/4: PythonListCustomToolGenerator with Python list format
- Handle tool_choice hints (auto/required/specific tool)
- Preserve existing system messages while adding tool context
Streaming & Tool Call Detection
- Implement streaming support via `params.stream` with `_stream_chat_completion()`
- Add tool call detection by decoding assistant messages after generation
- Set proper `finish_reason` based on content ("stop" vs "tool_calls")
- Convert internal ToolCall format to OpenAI-compatible types
- Stream chunks incrementally with proper delta formatting
Type Corrections
- Fix response_format handling in generators.py to properly extract schema from
OpenAIJSONSchema TypedDict and use correct ResponseFormatType enum
- Use correct OpenAI types: OpenAIChatCompletionToolCall, OpenAIChunkChoice,
OpenAIChoiceDelta, OpenAIChatCompletionToolCallFunction
Signed-off-by: Charlie Doern <[email protected]>
|
I believe the most recent commit adds a significant amount of functionality to the meta_reference inference provider. Previously this provider was not implemented from what I can tell, all methods raised So, the functionality in the last two commits is net-new, but satisfies reviewer comments and general usability of the provider as a whole. I would be wary to add more functionality though in this PR, since I believe it goes out of scope. Please let me know if that makes sense @ashwinb @leseb ! thanks |
|
@cdoern that is completely fair! would you mind filing an issue to say "ensure meta-reference works" or something to that effect? |
|
also, I don't know why the server GPT tests keep hanging but this is something I have observed on other PRs I have open as well. |
I think the conversations tests > sqlite store is the root cause. There are race conditions there we need to resolve carefully. |
|
@cdoern can you try running the failing test suite locally once? |
|
yup @ashwinb here are the results (all green): |
…3895) # What does this PR do? Extract API definitions and provider specifications into a standalone llama-stack-api package that can be published to PyPI independently of the main llama-stack server. see: #2978 and #2978 (comment) Motivation External providers currently import from llama-stack, which overrides the installed version and causes dependency conflicts. This separation allows external providers to: - Install only the type definitions they need without server dependencies - Avoid version conflicts with the installed llama-stack package - Be versioned and released independently This enables us to re-enable external provider module tests that were previously blocked by these import conflicts. Changes - Created llama-stack-api package with minimal dependencies (pydantic, jsonschema) - Moved APIs, providers datatypes, strong_typing, and schema_utils - Updated all imports from llama_stack.* to llama_stack_api.* - Configured local editable install for development workflow - Updated linting and type-checking configuration for both packages Next Steps - Publish llama-stack-api to PyPI - Update external provider dependencies - Re-enable external provider module tests Pre-cursor PRs to this one: - #4093 - #3954 - #4064 These PRs moved key pieces _out_ of the Api pkg, limiting the scope of change here. relates to #3237 ## Test Plan Package builds successfully and can be imported independently. All pre-commit hooks pass with expected exclusions maintained. --------- Signed-off-by: Charlie Doern <[email protected]>
What does this PR do?
Delete ~2,000 lines of dead code from the old bespoke inference API that was replaced by OpenAI-only API. This includes removing unused type conversion functions, dead provider methods, and event_logger.py.
Clean up imports across the codebase to remove references to deleted types. This eliminates unnecessary
code and dependencies, helping isolate the API package as a self-contained module.
This is the last interdependency between the .api package and "exterior" packages, meaning that now every other package in llama stack imports the API, not the other way around.
Test Plan
this is a structural change, no tests needed.