You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2023. It is now read-only.
This commit reduce overheads by avoiding a 2-step allocOutputs and run crossing
the Python/C++ boundary. This also removes the synchronization issue mentioned
in the previous commit but for which we will still need to find the root cause.
```
python ./test_python/pytorch_example.py
raw unchecked_run naive options Total CPU time to launch kernel: min 19us, p50 29us, p90 34us, max 2204us
raw unchecked_run naive options Total CPU launch + GPU kernel time: min 365us, p50 374us, p90 379us, max 2562us
Tune with cache @ /tmp/d17dd046-de20-40ce-b52d-79cd3286fcb4
Note that if you pass a fixed filename, you can reinforce an existing tuning state
Iteration 0 Jobs(Compiled, Evaluated)/total (25, 25)/25 (best/median/worst)us: 346/18111/196621
Iteration 1 Jobs(Compiled, Evaluated)/total (25, 25)/25 (best/median/worst)us: 346/905/1621
Iteration 2 Jobs(Compiled, Evaluated)/total (25, 25)/25 (best/median/worst)us: 335/739/1616
raw unchecked_run tuned options Total CPU time to launch kernel: min 17us, p50 21us, p90 23us, max 1030us
raw unchecked_run tuned options Total CPU launch + GPU kernel time: min 350us, p50 354us, p90 360us, max 1365us
TcBuilder unchecked_run Total CPU time to launch kernel: min 19us, p50 22us, p90 35us, max 2439us
TcBuilder unchecked_run Total CPU launch + GPU kernel time: min 303us, p50 307us, p90 320us, max 2712us
TcFunction forward unchecked_run Total CPU time to launch kernel: min 41us, p50 62us, p90 70us, max 857164us
TcFunction forward unchecked_run Total CPU launch + GPU kernel time: min 317us, p50 338us, p90 351us, max 857281us
TcFunction backward unchecked_run Total CPU time to launch kernel: min 344us, p50 388us, p90 412us, max 883us
TcFunction backward unchecked_run Total CPU launch + GPU kernel time: min 1321us, p50 1351us, p90 1371us, max 1849us
MultiTcBuilder unchecked_run Total CPU time to launch kernel: min 14us, p50 22us, p90 25us, max 1863us
MultiTcBuilder unchecked_run Total CPU launch + GPU kernel time: min 298us, p50 305us, p90 310us, max 2136us
MultiTcFunction forward unchecked_run Total CPU time to launch kernel: min 35us, p50 58us, p90 67us, max 506382us
MultiTcFunction forward unchecked_run Total CPU launch + GPU kernel time: min 197us, p50 334us, p90 342us, max 506619us
MultiTcFunction backward unchecked_run Total CPU time to launch kernel: min 275us, p50 364us, p90 383us, max 438us
MultiTcFunction backward unchecked_run Total CPU launch + GPU kernel time: min 1265us, p50 1333us, p90 1350us, max 1379us
```
0 commit comments