-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Labels
Description
When we use pytorch tensors with slangpy, the CPU overhead seems quite high. With profiling tools, it was narrowed down to slangpy_native_call
being the top runner there. Below is a simple benchmark that can be used to reproduce this issue.
With a local test run, I see average submit time 0.867732ms
. If the number of parameters are increased to say 5, this average becomes: average submit time 1.277054ms
.
With slangpy using torch tensors, there is quite an overhead expected, but the goal of this bug is to figure out how much of that can be reduced.
import pytest
import numpy as np
from time import time
import slangpy as spy
from slangpy.testing import helpers
from slangpy.testing.benchmark import BenchmarkSlangFunction
ADD_FLOATS = """
float add_floats(float a, float b) {
return a + b;
}
"""
def benchmark_torch(
device_type: spy.DeviceType
):
device = helpers.get_torch_device(device_type)
func = helpers.create_function_from_module(device, "add_floats", ADD_FLOATS)
BUFFER_SIZE = 64*1024*1024
import torch
a = torch.randn((BUFFER_SIZE,), dtype=torch.float32, device="cuda")
b = torch.randn((BUFFER_SIZE,), dtype=torch.float32, device="cuda")
res = torch.randn((BUFFER_SIZE,), dtype=torch.float32, device="cuda")
# Warmup + wait for cuda
func(a, b, _result=res)
device.wait_for_idle()
input("Press Enter to start torch benchmark...")
test_start = time()
for _ in range(1000):
func(a, b, _result=res)
submit_end = time()
device.wait_for_idle()
test_end = time()
end = time()
print(f"Torch benchmark completed in {test_end - test_start:.3f} seconds, average submit time {1000*(submit_end - test_start)/1000:.6f}ms")
if __name__ == "__main__":
#pytest.main([__file__, "-v", "-s"])
benchmark_torch(spy.DeviceType.cuda)