How does ISPC load from input arrays and stores into output arrays? #2466
Replies: 2 comments
-
Could you share your code via Compiler Explorer? Looking at assembly it should be pretty obvious if the instructions are same or not. |
Beta Was this translation helpful? Give feedback.
-
Actually I was thinking about this the wrong way: float *x_array; // millions of elements
__m256 *x_array_simd = new __m256[x_array_size];
//...
// This part was taking 40% of the time
for (unsigned int i = 0; i < VCOUNT; ++i) {
const unsigned int si = i * VECTOR_SIZE;
src_f0_simd[i] = _mm256_load_ps(src_f0 + si);
//...
}
for (unsigned int i = 0; i < VCOUNT; ++i) {
__m256 x = x_array_simd[i];
// Do work
// Store into output array
} I was loading the whole array from float to Now I do: for (unsigned int i = 0; i < VCOUNT; ++i) {
const unsigned int si = i * VECTOR_SIZE;
__m256 x = _mm256_load_ps(x_array + si);
// Do work
// Store into output array
} Now timings are a lot more similar. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm new to SIMD programming, and one of my first tests was to port a Perlin noise implementation to AVX2 manually, writing it the "SPMD way".
The API I wanted was
perlin(float *x_array, float *y_array, float *z_array, float *output, int count)
, but I often read that it's preferable to use the appropriate SIMD instructions to load and store data to SIMD registers, instead ofreinterpreting_cast
those arrays or using union tricks.So one of the steps I did was to load them into temporary arrays of
__m256
first, using code like this:And then I passed those temporary arrays to the SIMD implementation.
I did some profiling and found that the loading part took
566 us
, and the perlin noise calculation took1433 us
(so it took 1999 us in total).Then I ported the same perlin noise to ISPC, and found that it took
1300 us
, which is slightly better than my manual code (which arguably still had some stuff I knew wasn't optimal). But I notice that my "loading part" took a while, it's like ISPC didn't do this at all. I wondered how ISPC deals with it? Maybe I didn't need to do it this way, or I used the wrong instruction?Beta Was this translation helpful? Give feedback.
All reactions