-
Notifications
You must be signed in to change notification settings - Fork 107
feat: Ensemble async callback execution (rework) #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
// Case 1: Sequence batching is enabled | ||
// Case 2: Dynamic batching is disabled and there is only one instance group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the order needs to be preserved in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In gRPC streaming case, the client would expect the responses order match the requests order.
// Case 3: Dynamic batching is enabled and preserve_ordering is true | ||
// Case 4: Model transaction policy is decoupled (breaks RequestTracker | ||
// lifecycle) | ||
// Note: Although decoupled models do not preserve the order of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decoupled models "should preserve" the order of response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A decoupled model/backend may also send responses out-of-order relative to the order that the request batches are executed.
I found this from
https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md#decoupled-backends-and-models
|
||
// Attempt to enqueue the callback. If all workers are busy and queue is at | ||
// capacity, execute the callback immediately in current thread. | ||
if (pool->TaskQueueSize() < pool->Size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Size()
returns the number of worker and TaskQueueSize()
returns the number of "pending" task. You can consider the workers are busy when TaskQueueSize() > 0
, because pool->TaskQueueSize() == pool->Size()
actually means the # of pending requests equals to the # of workers, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. But consider a case where N busy workers are almost finishing. Then as long as TaskQueueSize <= N, the pending tasks can execute almost immediately. The maximum of N is 8.
In fact, I did compare if (pool->TaskQueueSize() == 0)
vs if (pool->TaskQueueSize() < pool->Size())
, and the latter one yielded higher throughput, indicating a small wait time is better than synchronus execution on average.
What does the PR do?
Reduce e2e latency in ensemble model by executing callbacks asynchronously at the end of each ensemble step. Excluding models that require responses to have the same order of requests.
Improvement: maximum throughput of sample ensemble model increased from 39k infer/sec to 50k infer/sec.
Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/common#133
Previous PR: #429
Where should the reviewer start?
Reviewer should start from the second commit.
Attention to the
preserve_responses_order
logic.Test plan:
L0_simple_ensemble
L0_sequence_batcher
L0_backend_python
28454142
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)