Skip to content

mkl_umath does not bring performance benefits relative to vanilla numpy #1

@samaid

Description

@samaid

https://gist.github.com/samaid/bb680421ee29926cc7b8e536ee9a931c

Test was run on Intel DevCloud on TGL node in two setups

  1. STOCK: Clean environment with numpy installed from -c conda-forge
  2. INTEL: Clean environment with numpy installed from -c intel
(intel) u184071@s019-n016:~/repos/dpnp-umath$ python test.py
NP: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
Buffer size: 8192
0.3318898677825928
NP: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
Buffer size: 1600000
0.3113992214202881
UM: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
0.30924224853515625 

(condaforge) u184071@s019-n016:~/repos/dpnp-umath$ python test.py
NP: [0.71962608 0.53769131 0.39456384 ... 0.20209085 0.19296594 0.17458681]
Buffer size: 8192
0.3226659297943115
NP: [0.71962608 0.53769131 0.39456384 ... 0.20209085 0.19296594 0.17458681]
Buffer size: 1600000
0.32870054244995117
No mkl_umath found. Skipping test...

NumPy performance difference between stock and intel is not observed on default buffer size, and only marginally better when numpy.setbufsize() is set to 16*10^5.

This behavior is not observed on SPR node in Intel DevCloud:

(intel) u184071@s018-n003:~/repos/dpnp-umath$ python test.py
NP: [0.71095155 0.23050819 0.1467021  ... 0.26945045 0.18541328 0.83865669]
Buffer size: 8192
0.4312753677368164
NP: [0.71095155 0.23050819 0.1467021  ... 0.26945045 0.18541328 0.83865669]
Buffer size: 1600000
0.04172515869140625
UM: [0.71095155 0.23050819 0.1467021  ... 0.26945045 0.18541328 0.83865669]
0.03204202651977539

(condaforge) u184071@s018-n003:~/repos/dpnp-umath$ python test.py
NP: [0.74352341 0.67897181 0.80952154 ... 0.02458932 0.78159    0.10357044]
Buffer size: 8192
0.34731459617614746
NP: [0.74352341 0.67897181 0.80952154 ... 0.02458932 0.78159    0.10357044]
Buffer size: 1600000
0.3502378463745117
No mkl_umath found. Skipping test..

Looks like no multithreading is exercised on TGL system. Second, default buffer size is too small to get any benefits from multi-threading. According to this chart, multithreading is beneficial with the buffer size greater than 10K and the performance is materially different on sizes 100K-1M:
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-vmperfdata/top/real-functions/trigonometric/sin.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions