Dynamically Allocated Engines #3714
narendasan
started this conversation in
RFCs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dynamically Allocated Engines
TL;DR
Some engines take >> than the amount of GPU memory of their original PyTorch modules. This means that workflows that worked in PyTorch will fail in TensorRT even if the engines are compiled.
If users are able to dynamically provide the memory to run an engine and then reuse it later then this would address this issue.
Goal(s)
Do not capture GPU memory until execution, as part of the execute engine process, the runtime will allocate the memory for the engine and then release it once the execution is completed.
Usecases
Running a diffusers pipeline where there are multiple sub modules, if the TRT engine is larger than the original pytorch module these workflows will fail unless the memory is dynamically released
Proposed APIs / UX
Similar to other runtime settings there will be two sets of APIs.
Example Workflow
Limitations
Does not address memory utilization of an engine, only handles multi module pipelines to allow a TRT engine to evacuate GPU memory before the next engine or PyTorch module runs.
Internal Implementation
Design
Extensions Required to Core API implementations
We need an additional runtime mode added to both the C++ and Python runtime
Data Structures
New enum to describe the runtime mode.
Implementation Phases
Prototype - S
Support in C++ runtime
MVP
(2.9.0)
SAll of the above supported in C++ and Python
Beta Was this translation helpful? Give feedback.
All reactions