fix: Revert async execution of ensemble model ResponseComplete callback #435

yinggeh · 2025-04-29T21:36:14Z

What does the PR do?

Fixes L0_backend_python--base and L0_sequence_batcher--base.

Initially ensemble asynchronous callback feature was introduced in #429 in order to reduce latency in overheads added in the ensemble pipeline.

Asynchrous ResponseComplete didn't work in decoupled model because the order of responses cannot be guarenteed. In rare case, TRITONSERVER_RESPONSE_COMPLETE_FINAL is sent before other responses and causes segfault.

Even though I disable async call of ResponseComplete in decoupled model, it's still failing one test case in L0_sequence_batcher. So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

fix

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:
27706423

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

yinggeh · 2025-04-29T21:45:15Z

src/constants.h

@@ -62,7 +62,7 @@ constexpr char kPythonBackend[] = "python";

 #ifdef TRITON_ENABLE_ENSEMBLE
 constexpr char kEnsemblePlatform[] = "ensemble";
-constexpr uint64_t ENSEMBLE_CB_POOL_SIZE = 8u;


Decrease thread pool size because there will be fewer callbacks running asynchronusly.

Do you even need this constant?

Yes. Because RequestComplete callback is still leveraging thread pool to run asynchrously, which will benefit the ensemble model throughput.

core/src/ensemble_scheduler/ensemble_scheduler.cc

Lines 608 to 632 in 729422e

void

EnsembleContext::RequestComplete(

TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp)

{

auto request_tracker = reinterpret_cast<RequestTracker*>(userp);

auto pool = request_tracker->CallbackPool();

auto fn = [request, flags, request_tracker]() {

if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) {

LOG_TRITONSERVER_ERROR(

TRITONSERVER_InferenceRequestDelete(request),

"deleting ensemble inference request");

if (request_tracker->DecrementCounter()) {

delete request_tracker;

}

}

};

// Attempt to enqueue the callback. If all workers are busy and queue is at

// capacity, execute the callback immediately.

if (pool->TaskQueueSize() < pool->Size()) {

pool->Enqueue(fn);

} else {

fn();

}

}

how was this number picked and why this number?

The previous pool size 8 was explained here #429 (comment).
In a non-decoupled ensemble model, this PR will reduce half of async callbacks. Thus I reduce pool size by half to 4.

My concern is how general those experiments were?

Could you elaborate? If thread pool queue is full, new task will execute synchronously like normal to avoid delay. See

core/src/ensemble_scheduler/ensemble_scheduler.cc

Lines 625 to 632 in 729422e

// Attempt to enqueue the callback. If all workers are busy and queue is at

// capacity, execute the callback immediately.

if (pool->TaskQueueSize() < pool->Size()) {

pool->Enqueue(fn);

} else {

fn();

}

}

GuanLuo · 2025-04-29T22:06:36Z

So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.

Why RequestComplete is also made async at the first place?

yinggeh · 2025-04-29T22:56:04Z

So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.

Why RequestComplete is also made async at the first place?

Explanation with visualization

From the chart, this composing model "preprocessing" backend thread takes 62 usec to complete, including 24 usec EnsembleContext::ResponseComplete and 7 usec EnsembleContext::RequestComplete.

Maximum throughput of sample ensemble_model comparison

Both synchronous (no optimization): 39794.5 infer/sec
Synchronous EnsembleContext::ResponseComplete and asynchronous EnsembleContext::RequestComplete (this PR): 47017.2 infer/sec

yinggeh added 2 commits April 29, 2025 10:03

Revert asynchronous ResponseComplete callback

d753fa7

Decrease thread pool size

729422e

yinggeh added the bug Something isn't working label Apr 29, 2025

yinggeh requested review from indrajit96, krishung5 and oandreeva-nv April 29, 2025 21:36

yinggeh self-assigned this Apr 29, 2025

yinggeh commented Apr 29, 2025

View reviewed changes

mc-nv requested a review from GuanLuo April 30, 2025 04:21

yinggeh closed this May 1, 2025

yinggeh deleted the yinggeh-revert-async-ensemble-response-complete branch June 18, 2025 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Revert async execution of ensemble model ResponseComplete callback #435

fix: Revert async execution of ensemble model ResponseComplete callback #435

Uh oh!

yinggeh commented Apr 29, 2025 •

edited

Loading

Uh oh!

yinggeh Apr 29, 2025

Uh oh!

oandreeva-nv Apr 29, 2025

Uh oh!

GuanLuo Apr 29, 2025

Uh oh!

yinggeh Apr 29, 2025 •

edited

Loading

Uh oh!

yinggeh Apr 29, 2025

Uh oh!

oandreeva-nv Apr 29, 2025

Uh oh!

yinggeh Apr 29, 2025

Uh oh!

oandreeva-nv Apr 29, 2025

Uh oh!

yinggeh Apr 29, 2025

Uh oh!

GuanLuo commented Apr 29, 2025

Uh oh!

yinggeh commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

	void
	EnsembleContext::RequestComplete(
	TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp)
	{
	auto request_tracker = reinterpret_cast<RequestTracker*>(userp);
	auto pool = request_tracker->CallbackPool();
	auto fn = [request, flags, request_tracker]() {
	if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) {
	LOG_TRITONSERVER_ERROR(
	TRITONSERVER_InferenceRequestDelete(request),
	"deleting ensemble inference request");
	if (request_tracker->DecrementCounter()) {
	delete request_tracker;
	}
	}
	};

	// Attempt to enqueue the callback. If all workers are busy and queue is at
	// capacity, execute the callback immediately.
	if (pool->TaskQueueSize() < pool->Size()) {
	pool->Enqueue(fn);
	} else {
	fn();
	}
	}

fix: Revert async execution of ensemble model ResponseComplete callback #435

fix: Revert async execution of ensemble model ResponseComplete callback #435

Uh oh!

Conversation

yinggeh commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinggeh Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GuanLuo commented Apr 29, 2025

Uh oh!

yinggeh commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explanation with visualization

Maximum throughput of sample ensemble_model comparison

Uh oh!

Uh oh!

yinggeh commented Apr 29, 2025 •

edited

Loading

yinggeh Apr 29, 2025 •

edited

Loading

yinggeh commented Apr 29, 2025 •

edited

Loading