Use size_t instead of unsigned int #555

camel-cdr · 2025-03-13T16:39:52Z

Currently both size_t and unsigned int are used as counters and to index array in separate parts of the code.

Using size_t as a default unsigned integer type is usually preferable, as it usually matches the native GPR width and thus produces better codegen, especially when used to index into arrays.

I measured the difference on the hardware I have and got a small improvement across all tested cores:

                 speedup
             before vs after
Zen1:       +2.181x vs +2.189x
Cortex-X1:  +1.917x vs +1.936x
Cortex-A78: +1.714x vs +1.739x
Cortex-A55: +1.726x vs +1.767x


Build cmd: cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)
Benchmark: python3 ./Test/astc_test_image.py --encoder native --test-set Small --test-quality all --repeats 5 -j 1

The ARM measurements are from my phone. I made sure to taskset to the correct cores and ran the old vs new simultaneously, on separate cores. There may have been some android throttling due to standby mode, so the runs between different cores may not be directly comparable.

The code changes are just a direct replacement of unsigned int with size_t when applicable, with the following exceptions:

I needed to introduce a _z user defined literal for size_t constants, because the astc:: math functions, that now operate on size_t, sometimes get literals past to them.
The printf format specifiers needed to be adjusted. Whiles doing so, I noticed that a few hard mismatching signedness:

astc-encoder/Source/astcenccli_error_metrics.cpp

Lines 137 to 141 in b5b87ef

    
           printf("WARNING: Only intersection of images will be compared:\n" 
        
                  "  Image 1: %dx%dx%d\n" 
        
                  "  Image 2: %dx%dx%d\n", 
        
                  img1->dim_x, img1->dim_y, img1->dim_z, 
        
                  img2->dim_x, img2->dim_y, img2->dim_z);

astc-encoder/Source/astcenccli_toplevel.cpp

Line 1254 in b5b87ef

    
           printf("    RGB alpha scale weight:     %d\n", (config.flags & ASTCENC_FLG_USE_ALPHA_WEIGHT));

astc-encoder/Source/astcenccli_toplevel.cpp

Line 2137 in b5b87ef

    
           printf("    Components:                 %d\n\n", image_uncomp_in_component_count);

camel-cdr · 2025-03-13T17:20:27Z

I'm not sure why the tests are failing.

./Test/astc_test_image.py completes without issue and I after reviewing the code changes again I didn't see anything that would cause this error.

The failing test ./Test/astc_test_functional.py also fails on the current master branch and even the 5.0.0 branch, when I tried it on my machine.

solidpixel · 2025-03-13T19:55:12Z

Currently both size_t and unsigned int are used as counters and to index array in separate parts of the code.

Fair enough, but you've also changed a lot of things that are not uses as either counters or array indices, and that are not used as sizes in "maths" at all most of the time (e.g. flags). Memory size matters, so changing the sizes of things in structs is generally unnecessary as most architectures will widen on load.

As much as I like the idea, honestly a single thousand line PR changing pretty much all the files in the repo is a painful way to do this.

The failing test ./Test/astc_test_functional.py also fails on the current master branch and even the 5.0.0 branch, when I tried it on my machine.

What's your build and platform config (compiler, compiler version, host OS, etc)?

The main branch runs fine here, and the tests pass on the main branch in CI before this commit. The error looks like a SIGABRT, so it should be possible to narrow down.

camel-cdr · 2025-03-13T20:49:49Z

Ok, I tried setting up a new ubuntu 24.04 docker instance and there everything ran fine with the main branch, and got the same error from this PR.
Looking at the error messages I got on my system again it looks like it errors in imagemagick, so I probably just have a broken imagemagick install.
I'm running debian unstable, which has the imagemagick version 7.1.1-43 (ubuntu 24.04 is on 6.9.12).
The errors are reproducible, when I changed the docker image to debian sid:

FROM debian:sid

RUN apt -y update && apt -y upgrade && apt -y install build-essential git cmake vim python3 clang g++ python3-numpy python3-pillow imagemagick
RUN git clone --recursive https://github.com/ARM-software/astc-encoder
WORKDIR /astc-encoder
RUN cmake -B build -DCMAKE_BUILD_TYPE=Release -DASTCENC_ISA_NONE=on -DCMAKE_INSTALL_PREFIX=/astc-encoder
RUN cmake --build build -j$(nproc)
RUN cmake --install build

RUN echo 'echo "test with: python3 ./Test/astc_test_functional.py --encoder none"' > ~/.bashrc

Since sid is unstable, it may be a problem with the imagemagick build, but it might also be due to the newer package version.

Memory size matters, so changing the sizes of things in structs is generally unnecessary as most architectures will widen on load.

As much as I like the idea, honestly a single thousand line PR changing pretty much all the files in the repo is a painful way to do this.

Fair enough, I could go through the code again and make sure to only convert "counts" and stuff used for indexing.
Is this something I should continue, and if so, do you have suggestions for how to split it up?

On that topic, I'm currently experimenting with porting the SIMD abstraction to support variable length SIMD ISAs.
So instead of SVE with fixed vector length, one backend that works for any SVE vector length.

I'm doing it as an experiment to see what it would take to port a smaller project to a style of SIMD abstraction that is compatible with the "sizeless" types of variable length SIMD ISAs and how the performance compares.
This entails changing the SIMD abstraction to use the built-in SIMD types directly, not wrapped in a struct, which makes it impossible to use operator overloading. So almost every SIMD operation needs to be touched.

If those changes were to be upstreamed, it would would run into a a similar huge-changes problem and in contrast to the size_t change it's almost un-subdividable.

solidpixel · 2025-03-13T21:18:39Z

Fair enough, I could go through the code again and make sure to only convert "counts" and stuff used for indexing.
Is this something I should continue, and if so, do you have suggestions for how to split it up?

Let me have a play and see what's worth cleaning up. If there is a benefit I don't mind doing it.

I only have an Intel x86 test machine to hand at the moment, and at least for the AVX2 builds with Clang 14 your PR ends up with no statistically significant performance difference and the code size ends up ~1KB larger. I'd like to take a look at Arm builds in the office, and try a few other compiler versions, just to rule out this configuration being strange ...

I'm doing it as an experiment to see what it would take to port a smaller project to a style of SIMD abstraction that is compatible with the "sizeless" types of variable length SIMD ISAs and how the performance compares.

The problem you'll hit with length agnostic SVE is that there is no fast way of doing a invariant quad-wise floating point reduction. We have a requirement to have invariant output across all ISAs/compilers.

SVE can do invariant scalar reduction using FADDA, but there is no "reduce in groups of 4" equivalent, which is what you really need to be able to super-vectorize existing SIMD floating point code with no reassociation. You can work around it for the fixed length versions - e.g. AVX2 and SVE256 both just do two 128-bit adds - but for VLA code you'd need a loop. And you really don't want to add an accumulator loop into the middle of every single DSP hot path in a compressor.

solidpixel · 2025-03-13T21:23:37Z

My conclusion for writing a fast invariant length-agnostic SVE implementation is that it would only be possible if we stopped using floating point, and rewrote the compressor core in integer maths. This probably isn't a stupid idea - we could do a lot in 16-bit ops and do twice as much per cycle - but it's a lot of work ...

solidpixel · 2025-03-13T21:25:35Z

(P.S. Thanks for triaging the ImageMagick issue - I'll see if I can repro that here and work out what's tripping it up)

solidpixel · 2025-03-13T21:56:29Z

NEON numbers with AppleClang look more promising - this PR gives ~2K reduction in code size, so the increase I saw earlier might well just be x86-64 instruction encoding being weird.

solidpixel · 2025-03-13T21:58:44Z

Marking this one as a draft - I'll merge the fixes in chunks in some smaller PRs.

camel-cdr · 2025-03-13T22:37:39Z

And you really don't want to add an accumulator loop into the middle of every single DSP hot path in a compressor.

As long as you can use the same or similar vector instructions, a slightly unrolled loop should be pretty good, considering that all branches will be predicted.

I suppose you could also fill four SVE registers with continuous data, deinterleave twice, and do four ordered reductions?
If the input data was is raw memory, it would be LD4D + 4xFADDA.
But FADDA is to slow for that, it would only be viable with the latency/throughput of FADDV and a larger vector length: https://godbolt.org/z/K8q14znMx

solidpixel · 2025-03-13T23:00:59Z

For this you need to be invariant with the existing 4-wide code path, allowing compatibility with SSE/NEON, not scalar code. You don't need to use FADDA (which would indeed be horribly slow, as it's a linear dependency chain).

Pseudo-code for what you want is something like:

// Init accumulator
vfloat4 accumulator = vfloat4::zero();

foreach vfloat vector in texels:
    // Do something - e.g. encoding error calculation

    // Accumulate something lane-wise in vec4 chunks
    foreach vfloat4 chunk in vfloat vector:
        accumulator += chunk;

// Scalar reduction happens outside of the loop, and only on the vfloat4 part
float sum = hadd_s(accumulator);

solidpixel · 2025-03-13T23:03:24Z

I rambled a bit about how we ended up with the current design here, if it's of interest (although the blog predates SVE):

https://solidpixel.github.io/creating_invariant_accumulators

camel-cdr · 2025-03-13T23:38:24Z

Thanks, for the great blog post, the one on SVE was also quite insightful.
I'll keep playing around with the code base on my end.

solidpixel · 2025-03-28T22:39:36Z

Closing this one - will merge piece-wise in new PRs.

replace unsigned int with size_t

4d4cfba

camel-cdr force-pushed the size_t branch from 5bb4f6f to 4d4cfba Compare March 13, 2025 16:50

solidpixel marked this pull request as draft March 13, 2025 21:58

solidpixel self-assigned this Mar 13, 2025

solidpixel closed this Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use size_t instead of unsigned int #555

Use size_t instead of unsigned int #555

Uh oh!

camel-cdr commented Mar 13, 2025 •

edited

Loading

Uh oh!

camel-cdr commented Mar 13, 2025 •

edited

Loading

Uh oh!

solidpixel commented Mar 13, 2025 •

edited

Loading

Uh oh!

camel-cdr commented Mar 13, 2025 •

edited

Loading

Uh oh!

solidpixel commented Mar 13, 2025 •

edited

Loading

Uh oh!

solidpixel commented Mar 13, 2025

Uh oh!

solidpixel commented Mar 13, 2025

Uh oh!

solidpixel commented Mar 13, 2025 •

edited

Loading

Uh oh!

solidpixel commented Mar 13, 2025

Uh oh!

camel-cdr commented Mar 13, 2025 •

edited

Loading

Uh oh!

solidpixel commented Mar 13, 2025 •

edited

Loading

Uh oh!

solidpixel commented Mar 13, 2025 •

edited

Loading

Uh oh!

camel-cdr commented Mar 13, 2025

Uh oh!

solidpixel commented Mar 28, 2025

Uh oh!

Uh oh!

	printf("WARNING: Only intersection of images will be compared:\n"
	" Image 1: %dx%dx%d\n"
	" Image 2: %dx%dx%d\n",
	img1->dim_x, img1->dim_y, img1->dim_z,
	img2->dim_x, img2->dim_y, img2->dim_z);

Use size_t instead of unsigned int #555

Use size_t instead of unsigned int #555

Uh oh!

Conversation

camel-cdr commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camel-cdr commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solidpixel commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camel-cdr commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solidpixel commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solidpixel commented Mar 13, 2025

Uh oh!

solidpixel commented Mar 13, 2025

Uh oh!

solidpixel commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solidpixel commented Mar 13, 2025

Uh oh!

camel-cdr commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solidpixel commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solidpixel commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camel-cdr commented Mar 13, 2025

Uh oh!

solidpixel commented Mar 28, 2025

Uh oh!

Uh oh!

camel-cdr commented Mar 13, 2025 •

edited

Loading

camel-cdr commented Mar 13, 2025 •

edited

Loading

solidpixel commented Mar 13, 2025 •

edited

Loading

camel-cdr commented Mar 13, 2025 •

edited

Loading

solidpixel commented Mar 13, 2025 •

edited

Loading

solidpixel commented Mar 13, 2025 •

edited

Loading

camel-cdr commented Mar 13, 2025 •

edited

Loading

solidpixel commented Mar 13, 2025 •

edited

Loading

solidpixel commented Mar 13, 2025 •

edited

Loading