feat: Ensemble async callback execution (rework) #438

yinggeh · 2025-05-14T21:52:50Z

What does the PR do?

Reduce e2e latency in ensemble model by executing callbacks asynchronously at the end of each ensemble step. Excluding models that require responses to have the same order of requests.

Improvement: maximum throughput of sample ensemble model increased from 39k infer/sec to 50k infer/sec.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

feat

Related PRs:

triton-inference-server/common#133
triton-inference-server/server#8231
Previous PR: #429

Where should the reviewer start?

Reviewer should start from the second commit.
Attention to the preserve_responses_order logic.

Test plan:

L0_trace
L0_simple_ensemble
L0_sequence_batcher
L0_backend_python

CI Pipeline ID:
29344284

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #7650

…" (#436)" This reverts commit 109c69f.

…nses order.

GuanLuo · 2025-05-20T09:57:16Z

src/ensemble_scheduler/ensemble_scheduler.cc

+  }
+
+  // Case 1: Sequence batching is enabled
+  // Case 2: Dynamic batching is disabled and there is only one instance group


I don't understand why the order needs to be preserved in this case.

In gRPC streaming case, the client would expect the responses order match the requests order.

I don’t think the scheduler needs to care whether the request is received via gRPC streaming or not. That is outer layer requirement. Scheduler only care whether itself needs to preserve ordering at model instance levels (i.e. whether preserve ordering is set / sequence batching is used)

Yes. GRPC stream is just one example where response order is guaranteed to be the same as request.
In case of one model instance, we don't want to use asynchronous callbacks, which violates the response order.

Tests in L0_sequence_batcher expects ensemble models with first step model (non-batching and single instance) to keep reponses same order as requests. If out-of-order, the second step model (sequence_batching) will not receive START flag as the first request, resulting test failure and even segfault in TensorRT backend TRITONBACKEND_ModelInstanceFinalize (didn't dig into the backtrace). Although non-batching and single instance case does not guarentee responses orderness, I will add this restriction to comply with existing tests (discussed offline).

If interested, model "sequence_onnx_nobatch_sequence_int32" and test case "test_backlog_same_correlation_id" combination is a good example.

GuanLuo · 2025-05-20T09:58:09Z

src/ensemble_scheduler/ensemble_scheduler.cc

+  // Case 3: Dynamic batching is enabled and preserve_ordering is true
+  // Case 4: Model transaction policy is decoupled (breaks RequestTracker
+  // lifecycle)
+  // Note: Although decoupled models do not preserve the order of


decoupled models "should preserve" the order of response

A decoupled model/backend may also send responses out-of-order relative to the order that the request batches are executed.

I found this from
https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md#decoupled-backends-and-models

That is referring to responses among different requests within the same batch, the responses of the same request is still preserved. i.e. batch can contain req1, req2, and respond in req2res1, req1res1, req1res2, req2res2

Updated comment

GuanLuo · 2025-05-20T10:04:01Z

src/ensemble_scheduler/ensemble_scheduler.cc

+
+  // Attempt to enqueue the callback. If all workers are busy and queue is at
+  // capacity, execute the callback immediately in current thread.
+  if (pool->TaskQueueSize() < pool->Size()) {


Is this correct? Size() returns the number of worker and TaskQueueSize() returns the number of "pending" task. You can consider the workers are busy when TaskQueueSize() > 0, because pool->TaskQueueSize() == pool->Size() actually means the # of pending requests equals to the # of workers, right?

Correct. But consider a case where N busy workers are almost finishing. Then as long as TaskQueueSize <= N, the pending tasks can execute almost immediately. The maximum of N is 8.

In fact, I did compare if (pool->TaskQueueSize() == 0) vs if (pool->TaskQueueSize() < pool->Size()), and the latter one yielded higher throughput, indicating a small wait time is better than synchronus execution on average.

GuanLuo

Approved to unblock, need to fix comments

GuanLuo · 2025-05-26T09:52:03Z

src/ensemble_scheduler/ensemble_scheduler.cc

+  }
+
+  // Case 1: Sequence batching is enabled
+  // Case 2: Dynamic batching is disabled and there is only one instance group


I don’t think the scheduler needs to care whether the request is received via gRPC streaming or not. That is outer layer requirement. Scheduler only care whether itself needs to preserve ordering at model instance levels (i.e. whether preserve ordering is set / sequence batching is used)

GuanLuo · 2025-05-26T09:55:19Z

src/ensemble_scheduler/ensemble_scheduler.cc

+  // Case 3: Dynamic batching is enabled and preserve_ordering is true
+  // Case 4: Model transaction policy is decoupled (breaks RequestTracker
+  // lifecycle)
+  // Note: Although decoupled models do not preserve the order of


That is referring to responses among different requests within the same batch, the responses of the same request is still preserved. i.e. batch can contain req1, req2, and respond in req2res1, req1res1, req1res2, req2res2

…into yinggeh-DLIS-8328-ensemble-async-callbacks-rework

yinggeh added 2 commits May 14, 2025 05:37

Revert "Revert "feat: Ensemble asynchronous callback executions (#429)…

c01d4cc

…" (#436)" This reverts commit 109c69f.

Execute callbacks synchronously for models required to preserve respo…

439826a

…nses order.

yinggeh self-assigned this May 14, 2025

yinggeh added the PR: feat A new feature label May 14, 2025

yinggeh requested review from tanmayv25, GuanLuo and ziqif-nv May 14, 2025 21:54

GuanLuo reviewed May 20, 2025

View reviewed changes

yinggeh requested a review from GuanLuo May 20, 2025 10:20

GuanLuo previously approved these changes May 26, 2025

View reviewed changes

Update comment

d2d45cb

yinggeh dismissed GuanLuo’s stale review via d2d45cb May 29, 2025 19:43

yinggeh requested a review from GuanLuo May 29, 2025 19:44

GuanLuo approved these changes May 30, 2025

View reviewed changes

Merge branch 'main' of https://github.com/triton-inference-server/core …

385e06e

…into yinggeh-DLIS-8328-ensemble-async-callbacks-rework

yinggeh mentioned this pull request Jun 3, 2025

test: Ensemble async callback execution (rework) triton-inference-server/server#8231

Merged

11 tasks

yinggeh merged commit b354d4d into main Jun 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Ensemble async callback execution (rework) #438

feat: Ensemble async callback execution (rework) #438

Uh oh!

yinggeh commented May 14, 2025 •

edited

Loading

Uh oh!

GuanLuo May 20, 2025

Uh oh!

yinggeh May 20, 2025

Uh oh!

GuanLuo May 26, 2025

Uh oh!

yinggeh May 29, 2025

Uh oh!

yinggeh May 30, 2025 •

edited

Loading

Uh oh!

GuanLuo May 20, 2025

Uh oh!

yinggeh May 20, 2025

Uh oh!

GuanLuo May 26, 2025

Uh oh!

yinggeh May 29, 2025

Uh oh!

GuanLuo May 20, 2025

Uh oh!

yinggeh May 20, 2025

Uh oh!

GuanLuo left a comment

Uh oh!

GuanLuo May 26, 2025

Uh oh!

GuanLuo May 26, 2025

Uh oh!

Uh oh!

Uh oh!

feat: Ensemble async callback execution (rework) #438

feat: Ensemble async callback execution (rework) #438

Uh oh!

Conversation

yinggeh commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinggeh May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GuanLuo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yinggeh commented May 14, 2025 •

edited

Loading

yinggeh May 30, 2025 •

edited

Loading