Skip to content

fix: Revert async execution of ensemble model ResponseComplete callback #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Apr 29, 2025

What does the PR do?

Fixes L0_backend_python--base and L0_sequence_batcher--base.

Initially ensemble asynchronous callback feature was introduced in #429 in order to reduce latency in overheads added in the ensemble pipeline.

Asynchrous ResponseComplete didn't work in decoupled model because the order of responses cannot be guarenteed. In rare case, TRITONSERVER_RESPONSE_COMPLETE_FINAL is sent before other responses and causes segfault.

Even though I disable async call of ResponseComplete in decoupled model, it's still failing one test case in L0_sequence_batcher. So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • fix

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:
    27706423

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@yinggeh yinggeh added the bug Something isn't working label Apr 29, 2025
@yinggeh yinggeh self-assigned this Apr 29, 2025
@@ -62,7 +62,7 @@ constexpr char kPythonBackend[] = "python";

#ifdef TRITON_ENABLE_ENSEMBLE
constexpr char kEnsemblePlatform[] = "ensemble";
constexpr uint64_t ENSEMBLE_CB_POOL_SIZE = 8u;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decrease thread pool size because there will be fewer callbacks running asynchronusly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you even need this constant?

Copy link
Contributor Author

@yinggeh yinggeh Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Because RequestComplete callback is still leveraging thread pool to run asynchrously, which will benefit the ensemble model throughput.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void
EnsembleContext::RequestComplete(
TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp)
{
auto request_tracker = reinterpret_cast<RequestTracker*>(userp);
auto pool = request_tracker->CallbackPool();
auto fn = [request, flags, request_tracker]() {
if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) {
LOG_TRITONSERVER_ERROR(
TRITONSERVER_InferenceRequestDelete(request),
"deleting ensemble inference request");
if (request_tracker->DecrementCounter()) {
delete request_tracker;
}
}
};
// Attempt to enqueue the callback. If all workers are busy and queue is at
// capacity, execute the callback immediately.
if (pool->TaskQueueSize() < pool->Size()) {
pool->Enqueue(fn);
} else {
fn();
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this number picked and why this number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous pool size 8 was explained here #429 (comment).
In a non-decoupled ensemble model, this PR will reduce half of async callbacks. Thus I reduce pool size by half to 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is how general those experiments were?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? If thread pool queue is full, new task will execute synchronously like normal to avoid delay. See

// Attempt to enqueue the callback. If all workers are busy and queue is at
// capacity, execute the callback immediately.
if (pool->TaskQueueSize() < pool->Size()) {
pool->Enqueue(fn);
} else {
fn();
}
}

@GuanLuo
Copy link
Contributor

GuanLuo commented Apr 29, 2025

So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.

Why RequestComplete is also made async at the first place?

@yinggeh
Copy link
Contributor Author

yinggeh commented Apr 29, 2025

So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.

Why RequestComplete is also made async at the first place?

Explanation with visualization

Screenshot from 2025-04-29 15-42-15
From the chart, this composing model "preprocessing" backend thread takes 62 usec to complete, including 24 usec EnsembleContext::ResponseComplete and 7 usec EnsembleContext::RequestComplete.

Maximum throughput of sample ensemble_model comparison

Both synchronous (no optimization): 39794.5 infer/sec
Synchronous EnsembleContext::ResponseComplete and asynchronous EnsembleContext::RequestComplete (this PR): 47017.2 infer/sec

@mc-nv mc-nv requested a review from GuanLuo April 30, 2025 04:21
@yinggeh yinggeh closed this May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging this pull request may close these issues.

3 participants