-
Notifications
You must be signed in to change notification settings - Fork 258
Use size_t instead of unsigned int #555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not sure why the tests are failing.
The failing test |
Fair enough, but you've also changed a lot of things that are not uses as either counters or array indices, and that are not used as sizes in "maths" at all most of the time (e.g. flags). Memory size matters, so changing the sizes of things in structs is generally unnecessary as most architectures will widen on load. As much as I like the idea, honestly a single thousand line PR changing pretty much all the files in the repo is a painful way to do this.
What's your build and platform config (compiler, compiler version, host OS, etc)? The main branch runs fine here, and the tests pass on the main branch in CI before this commit. The error looks like a SIGABRT, so it should be possible to narrow down. |
Ok, I tried setting up a new ubuntu 24.04 docker instance and there everything ran fine with the main branch, and got the same error from this PR. FROM debian:sid
RUN apt -y update && apt -y upgrade && apt -y install build-essential git cmake vim python3 clang g++ python3-numpy python3-pillow imagemagick
RUN git clone --recursive https://github.com/ARM-software/astc-encoder
WORKDIR /astc-encoder
RUN cmake -B build -DCMAKE_BUILD_TYPE=Release -DASTCENC_ISA_NONE=on -DCMAKE_INSTALL_PREFIX=/astc-encoder
RUN cmake --build build -j$(nproc)
RUN cmake --install build
RUN echo 'echo "test with: python3 ./Test/astc_test_functional.py --encoder none"' > ~/.bashrc Since sid is unstable, it may be a problem with the imagemagick build, but it might also be due to the newer package version.
Fair enough, I could go through the code again and make sure to only convert "counts" and stuff used for indexing. On that topic, I'm currently experimenting with porting the SIMD abstraction to support variable length SIMD ISAs. I'm doing it as an experiment to see what it would take to port a smaller project to a style of SIMD abstraction that is compatible with the "sizeless" types of variable length SIMD ISAs and how the performance compares. If those changes were to be upstreamed, it would would run into a a similar huge-changes problem and in contrast to the |
Let me have a play and see what's worth cleaning up. If there is a benefit I don't mind doing it. I only have an Intel x86 test machine to hand at the moment, and at least for the AVX2 builds with Clang 14 your PR ends up with no statistically significant performance difference and the code size ends up ~1KB larger. I'd like to take a look at Arm builds in the office, and try a few other compiler versions, just to rule out this configuration being strange ...
The problem you'll hit with length agnostic SVE is that there is no fast way of doing a invariant quad-wise floating point reduction. We have a requirement to have invariant output across all ISAs/compilers. SVE can do invariant scalar reduction using |
My conclusion for writing a fast invariant length-agnostic SVE implementation is that it would only be possible if we stopped using floating point, and rewrote the compressor core in integer maths. This probably isn't a stupid idea - we could do a lot in 16-bit ops and do twice as much per cycle - but it's a lot of work ... |
(P.S. Thanks for triaging the ImageMagick issue - I'll see if I can repro that here and work out what's tripping it up) |
NEON numbers with AppleClang look more promising - this PR gives ~2K reduction in code size, so the increase I saw earlier might well just be x86-64 instruction encoding being weird. |
Marking this one as a draft - I'll merge the fixes in chunks in some smaller PRs. |
As long as you can use the same or similar vector instructions, a slightly unrolled loop should be pretty good, considering that all branches will be predicted. I suppose you could also fill four SVE registers with continuous data, deinterleave twice, and do four ordered reductions? |
For this you need to be invariant with the existing 4-wide code path, allowing compatibility with SSE/NEON, not scalar code. You don't need to use Pseudo-code for what you want is something like:
|
I rambled a bit about how we ended up with the current design here, if it's of interest (although the blog predates SVE): |
Thanks, for the great blog post, the one on SVE was also quite insightful. |
Closing this one - will merge piece-wise in new PRs. |
Currently both
size_t
andunsigned int
are used as counters and to index array in separate parts of the code.Using
size_t
as a default unsigned integer type is usually preferable, as it usually matches the native GPR width and thus produces better codegen, especially when used to index into arrays.I measured the difference on the hardware I have and got a small improvement across all tested cores:
The ARM measurements are from my phone. I made sure to taskset to the correct cores and ran the old vs new simultaneously, on separate cores. There may have been some android throttling due to standby mode, so the runs between different cores may not be directly comparable.
The code changes are just a direct replacement of
unsigned int
withsize_t
when applicable, with the following exceptions:I needed to introduce a
_z
user defined literal forsize_t
constants, because theastc::
math functions, that now operate onsize_t
, sometimes get literals past to them.The printf format specifiers needed to be adjusted. Whiles doing so, I noticed that a few hard mismatching signedness:
astc-encoder/Source/astcenccli_error_metrics.cpp
Lines 137 to 141 in b5b87ef
astc-encoder/Source/astcenccli_toplevel.cpp
Line 1254 in b5b87ef
astc-encoder/Source/astcenccli_toplevel.cpp
Line 2137 in b5b87ef