Skip to content

Use size_t instead of unsigned int #555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

camel-cdr
Copy link
Contributor

@camel-cdr camel-cdr commented Mar 13, 2025

Currently both size_t and unsigned int are used as counters and to index array in separate parts of the code.

Using size_t as a default unsigned integer type is usually preferable, as it usually matches the native GPR width and thus produces better codegen, especially when used to index into arrays.

I measured the difference on the hardware I have and got a small improvement across all tested cores:

                 speedup
             before vs after
Zen1:       +2.181x vs +2.189x
Cortex-X1:  +1.917x vs +1.936x
Cortex-A78: +1.714x vs +1.739x
Cortex-A55: +1.726x vs +1.767x


Build cmd: cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)
Benchmark: python3 ./Test/astc_test_image.py --encoder native --test-set Small --test-quality all --repeats 5 -j 1

The ARM measurements are from my phone. I made sure to taskset to the correct cores and ran the old vs new simultaneously, on separate cores. There may have been some android throttling due to standby mode, so the runs between different cores may not be directly comparable.

The code changes are just a direct replacement of unsigned int with size_t when applicable, with the following exceptions:

  1. I needed to introduce a _z user defined literal for size_t constants, because the astc:: math functions, that now operate on size_t, sometimes get literals past to them.

  2. The printf format specifiers needed to be adjusted. Whiles doing so, I noticed that a few hard mismatching signedness:

printf("WARNING: Only intersection of images will be compared:\n"
" Image 1: %dx%dx%d\n"
" Image 2: %dx%dx%d\n",
img1->dim_x, img1->dim_y, img1->dim_z,
img2->dim_x, img2->dim_y, img2->dim_z);

printf(" RGB alpha scale weight: %d\n", (config.flags & ASTCENC_FLG_USE_ALPHA_WEIGHT));

printf(" Components: %d\n\n", image_uncomp_in_component_count);

@camel-cdr
Copy link
Contributor Author

camel-cdr commented Mar 13, 2025

I'm not sure why the tests are failing.

./Test/astc_test_image.py completes without issue and I after reviewing the code changes again I didn't see anything that would cause this error.

The failing test ./Test/astc_test_functional.py also fails on the current master branch and even the 5.0.0 branch, when I tried it on my machine.

@solidpixel
Copy link
Contributor

solidpixel commented Mar 13, 2025

Currently both size_t and unsigned int are used as counters and to index array in separate parts of the code.

Fair enough, but you've also changed a lot of things that are not uses as either counters or array indices, and that are not used as sizes in "maths" at all most of the time (e.g. flags). Memory size matters, so changing the sizes of things in structs is generally unnecessary as most architectures will widen on load.

As much as I like the idea, honestly a single thousand line PR changing pretty much all the files in the repo is a painful way to do this.

The failing test ./Test/astc_test_functional.py also fails on the current master branch and even the 5.0.0 branch, when I tried it on my machine.

What's your build and platform config (compiler, compiler version, host OS, etc)?

The main branch runs fine here, and the tests pass on the main branch in CI before this commit. The error looks like a SIGABRT, so it should be possible to narrow down.

@camel-cdr
Copy link
Contributor Author

camel-cdr commented Mar 13, 2025

Ok, I tried setting up a new ubuntu 24.04 docker instance and there everything ran fine with the main branch, and got the same error from this PR.
Looking at the error messages I got on my system again it looks like it errors in imagemagick, so I probably just have a broken imagemagick install.
I'm running debian unstable, which has the imagemagick version 7.1.1-43 (ubuntu 24.04 is on 6.9.12).
The errors are reproducible, when I changed the docker image to debian sid:

FROM debian:sid

RUN apt -y update && apt -y upgrade && apt -y install build-essential git cmake vim python3 clang g++ python3-numpy python3-pillow imagemagick
RUN git clone --recursive https://github.com/ARM-software/astc-encoder
WORKDIR /astc-encoder
RUN cmake -B build -DCMAKE_BUILD_TYPE=Release -DASTCENC_ISA_NONE=on -DCMAKE_INSTALL_PREFIX=/astc-encoder
RUN cmake --build build -j$(nproc)
RUN cmake --install build

RUN echo 'echo "test with: python3 ./Test/astc_test_functional.py --encoder none"' > ~/.bashrc

Since sid is unstable, it may be a problem with the imagemagick build, but it might also be due to the newer package version.

Memory size matters, so changing the sizes of things in structs is generally unnecessary as most architectures will widen on load.

As much as I like the idea, honestly a single thousand line PR changing pretty much all the files in the repo is a painful way to do this.

Fair enough, I could go through the code again and make sure to only convert "counts" and stuff used for indexing.
Is this something I should continue, and if so, do you have suggestions for how to split it up?


On that topic, I'm currently experimenting with porting the SIMD abstraction to support variable length SIMD ISAs.
So instead of SVE with fixed vector length, one backend that works for any SVE vector length.

I'm doing it as an experiment to see what it would take to port a smaller project to a style of SIMD abstraction that is compatible with the "sizeless" types of variable length SIMD ISAs and how the performance compares.
This entails changing the SIMD abstraction to use the built-in SIMD types directly, not wrapped in a struct, which makes it impossible to use operator overloading. So almost every SIMD operation needs to be touched.

If those changes were to be upstreamed, it would would run into a a similar huge-changes problem and in contrast to the size_t change it's almost un-subdividable.

@solidpixel
Copy link
Contributor

solidpixel commented Mar 13, 2025

Fair enough, I could go through the code again and make sure to only convert "counts" and stuff used for indexing.
Is this something I should continue, and if so, do you have suggestions for how to split it up?

Let me have a play and see what's worth cleaning up. If there is a benefit I don't mind doing it.

I only have an Intel x86 test machine to hand at the moment, and at least for the AVX2 builds with Clang 14 your PR ends up with no statistically significant performance difference and the code size ends up ~1KB larger. I'd like to take a look at Arm builds in the office, and try a few other compiler versions, just to rule out this configuration being strange ...

I'm doing it as an experiment to see what it would take to port a smaller project to a style of SIMD abstraction that is compatible with the "sizeless" types of variable length SIMD ISAs and how the performance compares.

The problem you'll hit with length agnostic SVE is that there is no fast way of doing a invariant quad-wise floating point reduction. We have a requirement to have invariant output across all ISAs/compilers.

SVE can do invariant scalar reduction using FADDA, but there is no "reduce in groups of 4" equivalent, which is what you really need to be able to super-vectorize existing SIMD floating point code with no reassociation. You can work around it for the fixed length versions - e.g. AVX2 and SVE256 both just do two 128-bit adds - but for VLA code you'd need a loop. And you really don't want to add an accumulator loop into the middle of every single DSP hot path in a compressor.

@solidpixel
Copy link
Contributor

My conclusion for writing a fast invariant length-agnostic SVE implementation is that it would only be possible if we stopped using floating point, and rewrote the compressor core in integer maths. This probably isn't a stupid idea - we could do a lot in 16-bit ops and do twice as much per cycle - but it's a lot of work ...

@solidpixel
Copy link
Contributor

(P.S. Thanks for triaging the ImageMagick issue - I'll see if I can repro that here and work out what's tripping it up)

@solidpixel
Copy link
Contributor

solidpixel commented Mar 13, 2025

NEON numbers with AppleClang look more promising - this PR gives ~2K reduction in code size, so the increase I saw earlier might well just be x86-64 instruction encoding being weird.

@solidpixel solidpixel marked this pull request as draft March 13, 2025 21:58
@solidpixel solidpixel self-assigned this Mar 13, 2025
@solidpixel
Copy link
Contributor

Marking this one as a draft - I'll merge the fixes in chunks in some smaller PRs.

@camel-cdr
Copy link
Contributor Author

camel-cdr commented Mar 13, 2025

And you really don't want to add an accumulator loop into the middle of every single DSP hot path in a compressor.

As long as you can use the same or similar vector instructions, a slightly unrolled loop should be pretty good, considering that all branches will be predicted.

I suppose you could also fill four SVE registers with continuous data, deinterleave twice, and do four ordered reductions?
If the input data was is raw memory, it would be LD4D + 4xFADDA.
But FADDA is to slow for that, it would only be viable with the latency/throughput of FADDV and a larger vector length: https://godbolt.org/z/K8q14znMx

@solidpixel
Copy link
Contributor

solidpixel commented Mar 13, 2025

For this you need to be invariant with the existing 4-wide code path, allowing compatibility with SSE/NEON, not scalar code. You don't need to use FADDA (which would indeed be horribly slow, as it's a linear dependency chain).

Pseudo-code for what you want is something like:

// Init accumulator
vfloat4 accumulator = vfloat4::zero();

foreach vfloat vector in texels:
    // Do something - e.g. encoding error calculation

    // Accumulate something lane-wise in vec4 chunks
    foreach vfloat4 chunk in vfloat vector:
        accumulator += chunk;

// Scalar reduction happens outside of the loop, and only on the vfloat4 part
float sum = hadd_s(accumulator);

@solidpixel
Copy link
Contributor

solidpixel commented Mar 13, 2025

I rambled a bit about how we ended up with the current design here, if it's of interest (although the blog predates SVE):

@camel-cdr
Copy link
Contributor Author

Thanks, for the great blog post, the one on SVE was also quite insightful.
I'll keep playing around with the code base on my end.

@solidpixel
Copy link
Contributor

Closing this one - will merge piece-wise in new PRs.

@solidpixel solidpixel closed this Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants