-
Notifications
You must be signed in to change notification settings - Fork 107
fix: Revert async execution of ensemble model ResponseComplete callback #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -62,7 +62,7 @@ constexpr char kPythonBackend[] = "python"; | |||
|
|||
#ifdef TRITON_ENABLE_ENSEMBLE | |||
constexpr char kEnsemblePlatform[] = "ensemble"; | |||
constexpr uint64_t ENSEMBLE_CB_POOL_SIZE = 8u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decrease thread pool size because there will be fewer callbacks running asynchronusly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you even need this constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Because RequestComplete
callback is still leveraging thread pool to run asynchrously, which will benefit the ensemble model throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core/src/ensemble_scheduler/ensemble_scheduler.cc
Lines 608 to 632 in 729422e
void | |
EnsembleContext::RequestComplete( | |
TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp) | |
{ | |
auto request_tracker = reinterpret_cast<RequestTracker*>(userp); | |
auto pool = request_tracker->CallbackPool(); | |
auto fn = [request, flags, request_tracker]() { | |
if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) { | |
LOG_TRITONSERVER_ERROR( | |
TRITONSERVER_InferenceRequestDelete(request), | |
"deleting ensemble inference request"); | |
if (request_tracker->DecrementCounter()) { | |
delete request_tracker; | |
} | |
} | |
}; | |
// Attempt to enqueue the callback. If all workers are busy and queue is at | |
// capacity, execute the callback immediately. | |
if (pool->TaskQueueSize() < pool->Size()) { | |
pool->Enqueue(fn); | |
} else { | |
fn(); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this number picked and why this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous pool size 8 was explained here #429 (comment).
In a non-decoupled ensemble model, this PR will reduce half of async callbacks. Thus I reduce pool size by half to 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is how general those experiments were?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? If thread pool queue is full, new task will execute synchronously like normal to avoid delay. See
core/src/ensemble_scheduler/ensemble_scheduler.cc
Lines 625 to 632 in 729422e
// Attempt to enqueue the callback. If all workers are busy and queue is at | |
// capacity, execute the callback immediately. | |
if (pool->TaskQueueSize() < pool->Size()) { | |
pool->Enqueue(fn); | |
} else { | |
fn(); | |
} | |
} |
Why |
What does the PR do?
Fixes
L0_backend_python--base
andL0_sequence_batcher--base
.Initially ensemble asynchronous callback feature was introduced in #429 in order to reduce latency in overheads added in the ensemble pipeline.
Asynchrous
ResponseComplete
didn't work in decoupled model because the order of responses cannot be guarenteed. In rare case,TRITONSERVER_RESPONSE_COMPLETE_FINAL
is sent before other responses and causes segfault.Even though I disable async call of
ResponseComplete
in decoupled model, it's still failing one test case inL0_sequence_batcher
. So I ended up reverting async ResponseComplete callback feature. But async RequestComplete callback is working fine.Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
27706423
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)