We find that many types of computing-resources (such as CUDA-GPU and FPGA) have parallel waiting problem, which is bad for deep learning inference applications which are computationally intensive and delay-sensitive. To solve the above problem, one can consider intercepting API calls from the hardware driver layer, as in GPU virtualization, but this makes the generality greatly reduced and the system over-coupled. Therefore, we innovatively start from the model and create a generic allocator and mask the driver layer scheduling to alleviate the above problems, expecting to obtain better service latency for each request.
we test our program with
GTX-2080Ti with 10-core GPU
gcc/g++
withv8.4.0
grpc
with github commit version8f6ae3599f247c3e0de604b5321538b99f3d68a3
protobuf
with3.22.2
(install by grpc source code)onnxruntime-gpu
withv1.12.1
- C++ compiler param support:
-std=c++17
-lstdc++fs
-lonnxruntime
-lprotobuf
-lpthread
- add
-DPARALLER_MODE
if you only want to mask our allocator mechanism.
- nlohmann::json library installed.
you can get muti-version by give compiler-flag, DLIR_MODE
(default), BNST_MODE
, FIFO_MODE
, OYST_MODE
and PARALLER_MODE
are available.
DLIR_MODE
: DLIR mode, to allow auto-split and sortBNST_MODE
: similar toDLIR_MODE
, but split is not allowed.OYST_MODE
: similar toDLIR_MODE
, but split is forced.PARALLER_MODE
: run all kinds of task in muti-process.FIFO_MODE
: run task with FIFO.
- compile
DLIR_MODE
as an example:
git clone git@github.com:EdgeScheduler/DLIR-Allocator.git
cd DLIR-Allocator
mkdir -p build && cd build
cmake ../ -DCOMPILE_MODE="DLIR_MODE"
make
# you can get binary in DLIR-Allocator/bin/release/DLIR-Allocator
- Also, you can compile to all version by run script directly:
git clone git@github.com:EdgeScheduler/DLIR-Allocator.git
cd DLIR-Allocator
./scripts/build.sh
# you can get all binary in DLIR-Allocator/bin/release/*-Allocator
- Operation System
- Linux (Ubuntu test)
- Hardware Support
- CUDA-GPU (GTX 2080Ti and Tesla-T4 test)
- to-do
- Mali-GPU
- FPGA
- DSP
Relationship with OnnxSplitRunner
In order to eliminate the negative effects of fake-multi-threading mechanism of Python
course by GIL
, we eventually decided to refactor the code with C++
. Raw Project with Python can still be found at: https://github.com/EdgeScheduler/OnnxSplitRunner