We find that many types of computing-resources (such as CUDA-GPU and FPGA) have parallel waiting problem, which is bad for deep learning inference applications which are computationally intensive and delay-sensitive. To solve the above problem, one can consider intercepting API calls from the hardware driver layer, as in GPU virtualization, but this makes the generality greatly reduced and the system over-coupled. Therefore, we innovatively start from the model and create a generic allocator and mask the driver layer scheduling to alleviate the above problems, expecting to obtain better service latency for each request.
we test our program with
GTX-2080Ti with 10-core GPU
gcc/g++
withv8.4.0
onnxruntime-gpu
withv1.12.1
- C++ compiler param support:
-std=c++17
-lstdc++fs
-lonnxruntime
-lprotobuf
-lpthread
- add
-DALLOW_GPU_PARALLEL
if you only want to mask our allocator mechanism.
- nlohmann::json library installed.
- Operation System
- Linux (Ubuntu test)
- Hardware Support
- CUDA-GPU (GTX 2080Ti and Tesla-T4 test)
- to-do
- Mali-GPU
- FPGA
- DSP