-
Notifications
You must be signed in to change notification settings - Fork 6
Description
https://gist.github.com/samaid/bb680421ee29926cc7b8e536ee9a931c
Test was run on Intel DevCloud on TGL node in two setups
- STOCK: Clean environment with numpy installed from -c conda-forge
- INTEL: Clean environment with numpy installed from -c intel
(intel) u184071@s019-n016:~/repos/dpnp-umath$ python test.py
NP: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
Buffer size: 8192
0.3318898677825928
NP: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
Buffer size: 1600000
0.3113992214202881
UM: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
0.30924224853515625
(condaforge) u184071@s019-n016:~/repos/dpnp-umath$ python test.py
NP: [0.71962608 0.53769131 0.39456384 ... 0.20209085 0.19296594 0.17458681]
Buffer size: 8192
0.3226659297943115
NP: [0.71962608 0.53769131 0.39456384 ... 0.20209085 0.19296594 0.17458681]
Buffer size: 1600000
0.32870054244995117
No mkl_umath found. Skipping test...
NumPy performance difference between stock and intel is not observed on default buffer size, and only marginally better when numpy.setbufsize()
is set to 16*10^5.
This behavior is not observed on SPR node in Intel DevCloud:
(intel) u184071@s018-n003:~/repos/dpnp-umath$ python test.py
NP: [0.71095155 0.23050819 0.1467021 ... 0.26945045 0.18541328 0.83865669]
Buffer size: 8192
0.4312753677368164
NP: [0.71095155 0.23050819 0.1467021 ... 0.26945045 0.18541328 0.83865669]
Buffer size: 1600000
0.04172515869140625
UM: [0.71095155 0.23050819 0.1467021 ... 0.26945045 0.18541328 0.83865669]
0.03204202651977539
(condaforge) u184071@s018-n003:~/repos/dpnp-umath$ python test.py
NP: [0.74352341 0.67897181 0.80952154 ... 0.02458932 0.78159 0.10357044]
Buffer size: 8192
0.34731459617614746
NP: [0.74352341 0.67897181 0.80952154 ... 0.02458932 0.78159 0.10357044]
Buffer size: 1600000
0.3502378463745117
No mkl_umath found. Skipping test..
Looks like no multithreading is exercised on TGL system. Second, default buffer size is too small to get any benefits from multi-threading. According to this chart, multithreading is beneficial with the buffer size greater than 10K and the performance is materially different on sizes 100K-1M:
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-vmperfdata/top/real-functions/trigonometric/sin.html