Skip to content

RFC for vector length agnostic SVE Vectorized class #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions RFC-0044-sve-vectorized-class.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# [Vector length agnostic SVE class]

**Authors:**
* @Ryo-not-rio


## **Summary**
A vector length agnostic implementation of the `Vectorized` class for SVE.

## **Motivation**
PyTorch contains a `Vectorized` class that wraps different SIMD intrinsics for different architectures by having a vector as a class attribute, and the class methods acting as the wrapper around the intrinsics. This works well for non-scalable vectors but poses an issue for SVE due to the inability to store them as class attributes. The current workaround for this is to use the compiler flag `-msve-vector-bits=\<bits\>`, however this is not ideal as this would 1. require separate `Vectorized` classes for different vector lengths and 2. does not allow for runtime detection of the actual vector length. We currently only have an implementation of the `Vectorized` class for 256-bit SVE machines but as we think about adding support for different vector length, we need to consider how to avoid code duplication as raised by @malfet [here](https://github.com/pytorch/pytorch/pull/138388#issuecomment-2635612409). This RFC aims to solve the issue but creating a `Vectorized` class that detects the vector length at runtime as SVE is intended to be used, allowing us to support different vector lengths without writing any duplicate code.

## **Proposed Implementation**
The basic premise of our proposal is to store not the SVE vector but an array in our `Vectorized` class which we will load from and store to with each operation. A minimal version is shown at the end.

Now this introduces quite an obvious overhead of an additional load and store operation with each op. However, the compiler is able to optimize these out with the following conditions:

1. The -O3 flag is set

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-O3 can introduce other issues, including code bloat. how do you handle that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, PyTorch is built with -O3 in release mode so there shouldn't be any extra code bloat

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That for pypi release, but deployment of pytorch may get built with different flags. Plus note that this being header only change also allows reuse of the class in other implementations including that of custom ops that dont live inside pytorch. And my custom op lib may not be built with O3 for size reasons. Imagine for example deploying this on mobile. so I do think we should touch upon this aspect carefully and at least call it out clearly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insight, that is something I hadn't considered so I'll put it into the RFC. I'm not sure if there's any mitigation we could do at the moment, I'll try less aggressive optimization flags to see how much the compiler can optimize this aspect

2. The `svptrue_b32()` predicate is used
3. You are storing to and then loading from the same pointer

Ensuring these conditions are met and by inlining the functions, we can rely on the compiler to optimize the duplicate load and stores, ensuring we do not introduce any regressions.
Copy link

@cfRod cfRod May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk of different behaviour across different compilers/versions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess we'll have to make sure this optimization happens on all our supported compilers


### The size problem
We face a challenge with this implementation due to the constraint of the size() function being constexpr. The size() function which returns the number of elements in the `Vectorized` class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr.

```
class Vectorized<float> {
float values[64]; // Maximum number of elements supported by any SVE machine

static inline const size_type size() {
return svcntw();
}
static inline Vectorized<float> loadu(const float * vs) {
Vectorized<float> v;
svfloat32_t vec = svld1_f32(svptrue_b32(), static_cast<const float *>(vs));
svst1_f32(svptrue_b32(), v.values, vec);
return v;
}

inline void store(void* ptr) const {
svfloat32_t vec = svld1_f32(svptrue_b32(), values);
svst1_f32(svptrue_b32(), static_cast<float *>(ptr), vec);
}

   inline Vectorized<float> abs() const {
svfloat32_t v = svld1_f32(svptrue_b32(), values);
v = svabs_f32_x(svptrue_b32(), *this);
svst1_f32(svptrue_b32(), values, v);
return *this;
}
}
```

## **Metrics **
- Reduction of code duplication
- Speedup of PyTorch on SVE machines with non-256 bit vectors
- Softmax sped up by 2.73x on Neoverse V2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these numbers similar for eager mode or compile?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For eager mode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one question i have is this: true power of SVE I presume is that you can have same binary compiled for one implementation of sve work on another and realize performance gain. That is great. If this means that pytorch distribution built for arm, when pip installed, can take advantage of vector length of that machine. is that how we plan to realize performance gain?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently, we only have SVE support for 256bit vector machines, with this change we can do as you suggested: build one version of PyTorch for aarch64 and use it on all aarch64 machines without worrying about vector length

- X * sigmoid * softmax sped up by 1.65x on Neoverse V2
- No speed or accuracy regression on 256-bit vectors


## **Drawbacks**
### Implementation cost
This is a large change which requires an overhaul of all of the current SVE `Vectorized` as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the `Vectorized` classes one by one, but the size() change will need to be done all at once.

### Sideffects from non-constexpr size()
There are a number of functions that use the size() function to initialize an array. These will have to be changed to an alternative such as a vector. Since a vector is implemented as an array under the hood, we hope this will not cause any regressions but a thorough benchmarking of these functions need to be done to ensure that this is the case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be concerned about this. Arrays go on the stack; vectors go on the heap.


### Memory footprint increase
By storing an array with the size "max SVE vector length (2048 bits being the maximum possible and 512 bits being the longest hardware available)", the memory footprint is increased by `2048 bits x number of existing Vectorized classes`. Since `Vectorized` classes are created and destoryed in loops with only a few instances existing simultaneously, we expect this effect to be minimal, but we should benchmark this using actual models. We could also limit this effect by using the maximum vector size currently available on hardware with scope to change this if necessary.

## **Alternatives**
To keep the size() function constexpr, we considered setting the size of the `Vectorized` class to be the maximum possible SVE vector length and loading multiple vectors as necessary. However, this poses the following problems:

1. Increases the tail loop size unecessarily for machines with smaller vector lengths.
2. Compiler can no longer optimize out duplicate loads and stores as multiple vectors need to be handled consecutively.
3. Detection of number of vectors needed introduces an extra overhead

Due to these issues combined, especially 2., this alternative introduces a ~30% overhead compared to the current implementation.

## **Unresolved questions**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any plans to add specific tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit tests, the existing tests should cover us, for perf tests, we should run the whole set of integration tests

* How much of the existing code has to change due to changing the size() from constexpr to const
* What if any performance regression will we see due to changing size() from constexpr to const?
* Will we have to potentially have to write code specific to SVE due to the size() change?


## Resolution
TBD

### Level of Support
Choose one of the following:
* 1: Overwhelming positive feedback.
* 2: Positive feedback.
* 3: Majority Acceptance, with conflicting Feedback.
* 4: Acceptance, with Little Feedback.
* 5: Unclear Resolution.
* 6: RFC Rejected.
* 7: RFC Rejected, with Conflicting Feedback.


#### Additional Context
[Working proof of concept code](https://github.com/Ryo-not-rio/pytorch/commit/b2e5c66017fb48230d1ea2493b8548ad76d88fcf)


### Next Steps
TBD


#### Tracking issue
https://github.com/pytorch/pytorch/issues/153471


#### Exceptions
TBD