Skip to content

feat: Ensemble async callback execution (rework) #438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yinggeh
Copy link
Contributor

@yinggeh yinggeh commented May 14, 2025

What does the PR do?

Reduce e2e latency in ensemble model by executing callbacks asynchronously at the end of each ensemble step. Excluding models that require responses to have the same order of requests.

Improvement: maximum throughput of sample ensemble model increased from 39k infer/sec to 50k infer/sec.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • feat

Related PRs:

triton-inference-server/common#133
Previous PR: #429

Where should the reviewer start?

Reviewer should start from the second commit.
Attention to the preserve_responses_order logic.

Test plan:

L0_simple_ensemble
L0_sequence_batcher
L0_backend_python

  • CI Pipeline ID:
    28454142

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #7650

@yinggeh yinggeh self-assigned this May 14, 2025
@yinggeh yinggeh added the PR: feat A new feature label May 14, 2025
@yinggeh yinggeh requested review from tanmayv25, GuanLuo and ziqif-nv May 14, 2025 21:54
}

// Case 1: Sequence batching is enabled
// Case 2: Dynamic batching is disabled and there is only one instance group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the order needs to be preserved in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In gRPC streaming case, the client would expect the responses order match the requests order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think the scheduler needs to care whether the request is received via gRPC streaming or not. That is outer layer requirement. Scheduler only care whether itself needs to preserve ordering at model instance levels (i.e. whether preserve ordering is set / sequence batching is used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. GRPC stream is just one example where response order is guaranteed to be the same as request.
In case of one model instance, we don't want to use asynchronous callbacks, which violates the response order.

// Case 3: Dynamic batching is enabled and preserve_ordering is true
// Case 4: Model transaction policy is decoupled (breaks RequestTracker
// lifecycle)
// Note: Although decoupled models do not preserve the order of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoupled models "should preserve" the order of response

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decoupled model/backend may also send responses out-of-order relative to the order that the request batches are executed.

I found this from
https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md#decoupled-backends-and-models

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is referring to responses among different requests within the same batch, the responses of the same request is still preserved. i.e. batch can contain req1, req2, and respond in req2res1, req1res1, req1res2, req2res2


// Attempt to enqueue the callback. If all workers are busy and queue is at
// capacity, execute the callback immediately in current thread.
if (pool->TaskQueueSize() < pool->Size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Size() returns the number of worker and TaskQueueSize() returns the number of "pending" task. You can consider the workers are busy when TaskQueueSize() > 0, because pool->TaskQueueSize() == pool->Size() actually means the # of pending requests equals to the # of workers, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. But consider a case where N busy workers are almost finishing. Then as long as TaskQueueSize <= N, the pending tasks can execute almost immediately. The maximum of N is 8.

In fact, I did compare if (pool->TaskQueueSize() == 0) vs if (pool->TaskQueueSize() < pool->Size()), and the latter one yielded higher throughput, indicating a small wait time is better than synchronus execution on average.

@yinggeh yinggeh requested a review from GuanLuo May 20, 2025 10:20
Copy link
Contributor

@GuanLuo GuanLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved to unblock, need to fix comments

}

// Case 1: Sequence batching is enabled
// Case 2: Dynamic batching is disabled and there is only one instance group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think the scheduler needs to care whether the request is received via gRPC streaming or not. That is outer layer requirement. Scheduler only care whether itself needs to preserve ordering at model instance levels (i.e. whether preserve ordering is set / sequence batching is used)

// Case 3: Dynamic batching is enabled and preserve_ordering is true
// Case 4: Model transaction policy is decoupled (breaks RequestTracker
// lifecycle)
// Note: Although decoupled models do not preserve the order of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is referring to responses among different requests within the same batch, the responses of the same request is still preserved. i.e. batch can contain req1, req2, and respond in req2res1, req1res1, req1res2, req2res2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: feat A new feature
Development

Successfully merging this pull request may close these issues.

2 participants