From 8383d778a8b039461299723bcec121bd796afbdd Mon Sep 17 00:00:00 2001
From: Ryo Suzuki <ryo.suzuki@arm.com>
Date: Mon, 12 May 2025 11:30:59 +0000
Subject: [PATCH 1/5] RFC for vector length agnostic SVE class

---
 RFC-0044-sve-vectorized-class.md | 112 +++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)
 create mode 100644 RFC-0044-sve-vectorized-class.md
diff --git a/RFC-0044-sve-vectorized-class.md b/RFC-0044-sve-vectorized-class.md
new file mode 100644
index 00000000..44a7732a
--- /dev/null
+++ b/RFC-0044-sve-vectorized-class.md
@@ -0,0 +1,112 @@
+# [Vector length agnostic SVE class]
+
+**Authors:**
+* @Ryo-not-rio
+
+
+## **Summary**
+A vector length agnostic implementation of the `Vectorized` class for SVE.
+
+## **Motivation**
+PyTorch contains a "Vectorized" class that wraps different SIMD intrinsics for different architectures by having a vector as a class attribute, and the class methods acting as the wrapper around the intrinsics.  This works well for non-scalable vectors but poses an issue for SVE due to the inability to store them as class attributes. The current workaround for this is to use the compiler flag -msve-vector-bits=\<bits\>, however this is not ideal as this would 1. require separate vectorized classes for different vector lengths and 2. does not allow for runtime detection of the actual vector length. We currently only have an implementation of the Vectorized class for 256-bit SVE machines but as we think about adding support for different vector length, we need to consider how to avoid code duplication as raised by @malfet [here](https://github.com/pytorch/pytorch/pull/138388#issuecomment-2635612409). This RFC aims to solve the issue but creating a Vectorized class that detects the vector length at runtime as SVE is intended to be used, allowing us to support different vector lengths without writing any duplicate code.
+
+## **Proposed Implementation**
+The basic premise of our proposal is to store not the vector but an array in our vectorized class which we will load from and store to with each operation. A minimal version is shown at the end.
+
+Now this introduces quite an obvious overhead of an additional load and store operation with each op. However, the compiler is able to optimize these out with the following conditions:
+
+1. The -O3 flag is set
+2. The `svptrue_b32()` predicate is used
+3. You are storing to and then loading from the same pointer
+
+Ensuring these conditions are met and by inlining the functions, we can rely on the compiler to optimize the duplicate load and stores, ensuring we do not introduce any regressions.
+
+### The size problem
+We face a challenge with this implementation due to the constraint of the size() function being constexpr. The size() function which returns the number of elements in the Vectorized class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr.
+
+```
+class Vectorized<float> {
+	float values[64]; // Maximum number of elements supported by any SVE machine
+
+    static inline const size_type size() {
+        return svcntw();
+    }
+    static inline Vectorized<float> loadu(const float * vs) {
+        Vectorized<float> v;
+        svfloat32_t vec = svld1_f32(svptrue_b32(), static_cast<const float *>(vs));
+        svst1_f32(svptrue_b32(), v.values, vec);
+        return v;
+    }
+
+    inline void store(void* ptr) const {
+        svfloat32_t vec = svld1_f32(svptrue_b32(), values);
+        svst1_f32(svptrue_b32(), static_cast<float *>(ptr), vec);
+    }
+
+    inline Vectorized<float> abs() const {
+		svfloat32_t v = svld1_f32(svptrue_b32(), values);
+    	v = svabs_f32_x(svptrue_b32(), *this);
+     	svst1_f32(svptrue_b32(), values, v);
+		return *this;
+  	}
+}
+```
+
+## **Metrics **
+- Reduction of code duplication
+- Speedup of PyTorch on SVE machines with non-256 bit vectors
+  - Softmax sped up by 2.73x on Neoverse V2
+  - X * sigmoid * softmax sped up by 1.65x on Neoverse V2
+- No speed or accuracy regression on 256-bit vectors
+
+
+## **Drawbacks**
+### Implementation cost
+This is a large change which requires an overhaul of all of the current SVE Vectorized as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the Vectorized classes one by one, but the size() change will need to be done all at once.
+
+### Sideffects from non-constexpr size()
+There are a number of functions that use the size() function to initialize an array. These will have to be changed to an alternative such as a vector. Since a vector is implemented as an array under the hood, we hope this will not cause any regressions but a thorough benchmarking of these functions need to be done to ensure that this is the case.
+
+## **Alternatives**
+To keep the size() function constexpr, we considered setting the size of the Vectorized class to be the maximum possible SVE vector length (currently 512 bits) and loading multiple vectors as necessary. However, this poses the following problems:
+
+1. Increases the tail loop size unecessarily for machines with smaller vector lengths.
+2. Compiler can no longer optimize out duplicate loads and stores as multiple vectors need to be handled consecutively.
+3. Detection of number of vectors needed introduces an extra overhead
+
+Due to these issues combined, especially 2., this alternative introduces a ~30% overhead compared to the current implementation.
+
+## **Unresolved questions**
+* How much of the existing code has to change due to changing the size() from constexpr to const
+* What if any performance regression will we see due to changing size() from constexpr to const?
+* Will we have to potentially have to write code specific to SVE due to the size() change?
+
+
+## Resolution
+TBD
+
+### Level of Support
+Choose one of the following:
+* 1: Overwhelming positive feedback.
+* 2: Positive feedback.
+* 3: Majority Acceptance, with conflicting Feedback.
+* 4: Acceptance, with Little Feedback.
+* 5: Unclear Resolution.
+* 6: RFC Rejected.
+* 7: RFC Rejected, with Conflicting Feedback.
+
+
+#### Additional Context
+TBD
+
+
+### Next Steps
+TBD
+
+
+#### Tracking issue
+<github issue URL>
+
+
+#### Exceptions
+TBD
\ No newline at end of file

From 517ae4a9f3192e9fc4ad1d275fce419cf04d0fac Mon Sep 17 00:00:00 2001
From: Ryo Suzuki <ryo.suzuki@arm.com>
Date: Tue, 13 May 2025 16:01:00 +0000
Subject: [PATCH 2/5] add github tracking issue and poc link

---
 RFC-0044-sve-vectorized-class.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0044-sve-vectorized-class.md b/RFC-0044-sve-vectorized-class.md
index 44a7732a..295c2203 100644
--- a/RFC-0044-sve-vectorized-class.md
+++ b/RFC-0044-sve-vectorized-class.md
@@ -97,7 +97,7 @@ Choose one of the following:
 
 
 #### Additional Context
-TBD
+[Working proof of concept code](https://github.com/Ryo-not-rio/pytorch/commit/b2e5c66017fb48230d1ea2493b8548ad76d88fcf)
 
 
 ### Next Steps
@@ -105,7 +105,7 @@ TBD
 
 
 #### Tracking issue
-<github issue URL>
+https://github.com/pytorch/pytorch/issues/153471
 
 
 #### Exceptions

From 2e31ce0106c3f585a86eceda58d6601716c4d62d Mon Sep 17 00:00:00 2001
From: Ryo Suzuki <ryo.suzuki@arm.com>
Date: Fri, 16 May 2025 09:33:51 +0000
Subject: [PATCH 3/5] address minor comments

---
 RFC-0044-sve-vectorized-class.md | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/RFC-0044-sve-vectorized-class.md b/RFC-0044-sve-vectorized-class.md
index 295c2203..aff94779 100644
--- a/RFC-0044-sve-vectorized-class.md
+++ b/RFC-0044-sve-vectorized-class.md
@@ -8,10 +8,10 @@
 A vector length agnostic implementation of the `Vectorized` class for SVE.
 
 ## **Motivation**
-PyTorch contains a "Vectorized" class that wraps different SIMD intrinsics for different architectures by having a vector as a class attribute, and the class methods acting as the wrapper around the intrinsics.  This works well for non-scalable vectors but poses an issue for SVE due to the inability to store them as class attributes. The current workaround for this is to use the compiler flag -msve-vector-bits=\<bits\>, however this is not ideal as this would 1. require separate vectorized classes for different vector lengths and 2. does not allow for runtime detection of the actual vector length. We currently only have an implementation of the Vectorized class for 256-bit SVE machines but as we think about adding support for different vector length, we need to consider how to avoid code duplication as raised by @malfet [here](https://github.com/pytorch/pytorch/pull/138388#issuecomment-2635612409). This RFC aims to solve the issue but creating a Vectorized class that detects the vector length at runtime as SVE is intended to be used, allowing us to support different vector lengths without writing any duplicate code.
+PyTorch contains a `Vectorized` class that wraps different SIMD intrinsics for different architectures by having a vector as a class attribute, and the class methods acting as the wrapper around the intrinsics.  This works well for non-scalable vectors but poses an issue for SVE due to the inability to store them as class attributes. The current workaround for this is to use the compiler flag `-msve-vector-bits=\<bits\>`, however this is not ideal as this would 1. require separate `Vectorized` classes for different vector lengths and 2. does not allow for runtime detection of the actual vector length. We currently only have an implementation of the `Vectorized` class for 256-bit SVE machines but as we think about adding support for different vector length, we need to consider how to avoid code duplication as raised by @malfet [here](https://github.com/pytorch/pytorch/pull/138388#issuecomment-2635612409). This RFC aims to solve the issue but creating a `Vectorized` class that detects the vector length at runtime as SVE is intended to be used, allowing us to support different vector lengths without writing any duplicate code.
 
 ## **Proposed Implementation**
-The basic premise of our proposal is to store not the vector but an array in our vectorized class which we will load from and store to with each operation. A minimal version is shown at the end.
+The basic premise of our proposal is to store not the SVE vector but an array in our `Vectorized` class which we will load from and store to with each operation. A minimal version is shown at the end.
 
 Now this introduces quite an obvious overhead of an additional load and store operation with each op. However, the compiler is able to optimize these out with the following conditions:
 
@@ -22,7 +22,7 @@ Now this introduces quite an obvious overhead of an additional load and store op
 Ensuring these conditions are met and by inlining the functions, we can rely on the compiler to optimize the duplicate load and stores, ensuring we do not introduce any regressions.
 
 ### The size problem
-We face a challenge with this implementation due to the constraint of the size() function being constexpr. The size() function which returns the number of elements in the Vectorized class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr.
+We face a challenge with this implementation due to the constraint of the size() function being constexpr. The size() function which returns the number of elements in the `Vectorized` class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr.
 
 ```
 class Vectorized<float> {
@@ -62,13 +62,16 @@ class Vectorized<float> {
 
 ## **Drawbacks**
 ### Implementation cost
-This is a large change which requires an overhaul of all of the current SVE Vectorized as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the Vectorized classes one by one, but the size() change will need to be done all at once.
+This is a large change which requires an overhaul of all of the current SVE `Vectorized` as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the `Vectorized` classes one by one, but the size() change will need to be done all at once.
 
 ### Sideffects from non-constexpr size()
 There are a number of functions that use the size() function to initialize an array. These will have to be changed to an alternative such as a vector. Since a vector is implemented as an array under the hood, we hope this will not cause any regressions but a thorough benchmarking of these functions need to be done to ensure that this is the case.
 
+### Memory footprint increase
+By storing an array with the size "max SVE vector length (2048 bits being the maximum possible and 512 bits being the longest hardware available)", the memory footprint is increased by `2048 bits x number of existing Vectorized classes`. Since `Vectorized` classes are created and destoryed in loops with only a few instances existing simultaneously, we expect this effect to be minimal, but we should benchmark this using actual models. We could also limit this effect by using the maximum vector size currently available on hardware with scope to change this if necessary.
+
 ## **Alternatives**
-To keep the size() function constexpr, we considered setting the size of the Vectorized class to be the maximum possible SVE vector length (currently 512 bits) and loading multiple vectors as necessary. However, this poses the following problems:
+To keep the size() function constexpr, we considered setting the size of the `Vectorized` class to be the maximum possible SVE vector length and loading multiple vectors as necessary. However, this poses the following problems:
 
 1. Increases the tail loop size unecessarily for machines with smaller vector lengths.
 2. Compiler can no longer optimize out duplicate loads and stores as multiple vectors need to be handled consecutively.

From 65d4cab77e3e0c76586e5718f7373c1f34e61004 Mon Sep 17 00:00:00 2001
From: Ryo Suzuki <ryo.suzuki@arm.com>
Date: Wed, 21 May 2025 14:47:58 +0000
Subject: [PATCH 4/5] Add details of the size() change

---
 RFC-0044-sve-vectorized-class.md | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/RFC-0044-sve-vectorized-class.md b/RFC-0044-sve-vectorized-class.md
index aff94779..118b4760 100644
--- a/RFC-0044-sve-vectorized-class.md
+++ b/RFC-0044-sve-vectorized-class.md
@@ -22,7 +22,7 @@ Now this introduces quite an obvious overhead of an additional load and store op
 Ensuring these conditions are met and by inlining the functions, we can rely on the compiler to optimize the duplicate load and stores, ensuring we do not introduce any regressions.
 
 ### The size problem
-We face a challenge with this implementation due to the constraint of the size() function being constexpr. The size() function which returns the number of elements in the `Vectorized` class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr.
+We face a challenge with this implementation due to the constraint of the size() function being constexpr. The size() function which returns the number of elements in the `Vectorized` class cannot be constexpr in our implmentation due to SVE vector lengths being unknown at compile time. We propose we change this to be const instead of constexpr. Currently, size() is used to initialize std::arrays and to instantiate templated functions. These will need to be replaced with c arrays and the template parameters made into function arguments. The full list of changes that need to occur can be seen [here](https://github.com/pytorch/pytorch/commit/fa05c1de3340215da5dc0a32612e75e2816fc143).
 
 ```
 class Vectorized<float> {
@@ -62,14 +62,22 @@ class Vectorized<float> {
 
 ## **Drawbacks**
 ### Implementation cost
-This is a large change which requires an overhaul of all of the current SVE `Vectorized` as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the `Vectorized` classes one by one, but the size() change will need to be done all at once.
+This is a large change which requires an overhaul of all of the current SVE `Vectorized` as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the `Vectorized` classes one by one.
 
 ### Sideffects from non-constexpr size()
-There are a number of functions that use the size() function to initialize an array. These will have to be changed to an alternative such as a vector. Since a vector is implemented as an array under the hood, we hope this will not cause any regressions but a thorough benchmarking of these functions need to be done to ensure that this is the case.
+By changing the size() to non-constexpr, we will be changing a large part of the codebase which may cause regressions. These will need to be benchmarked thoroughly and if we choose to accept any regressions, they will need to be limited to aarch64 architectures.
 
 ### Memory footprint increase
 By storing an array with the size "max SVE vector length (2048 bits being the maximum possible and 512 bits being the longest hardware available)", the memory footprint is increased by `2048 bits x number of existing Vectorized classes`. Since `Vectorized` classes are created and destoryed in loops with only a few instances existing simultaneously, we expect this effect to be minimal, but we should benchmark this using actual models. We could also limit this effect by using the maximum vector size currently available on hardware with scope to change this if necessary.
 
+## **Benchmarking plan**
+To mitigate the risk from changing the size() from constexpr to const, we propose the following order of patches to PyTorch:
+
+1. Make individual pull requests for each function affected by this change
+2. Bench mark each patch thoroughly both on aarch64 and x86 for regressions
+3. Once all affected functions are merged, switch the Vectorized class to the VLA implementation
+4. Benchmark the VLA Vectorized class
+
 ## **Alternatives**
 To keep the size() function constexpr, we considered setting the size of the `Vectorized` class to be the maximum possible SVE vector length and loading multiple vectors as necessary. However, this poses the following problems:
 

From 8f69d79205a33f5d21306480bdfaa6910856e2a7 Mon Sep 17 00:00:00 2001
From: Ryo Suzuki <ryo.suzuki@arm.com>
Date: Wed, 4 Jun 2025 11:26:37 +0000
Subject: [PATCH 5/5] Add section about -O3 flag

---
 RFC-0044-sve-vectorized-class.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/RFC-0044-sve-vectorized-class.md b/RFC-0044-sve-vectorized-class.md
index 118b4760..cada223a 100644
--- a/RFC-0044-sve-vectorized-class.md
+++ b/RFC-0044-sve-vectorized-class.md
@@ -64,12 +64,15 @@ class Vectorized<float> {
 ### Implementation cost
 This is a large change which requires an overhaul of all of the current SVE `Vectorized` as well as any code that expects the size() function to be constexpr. The first cost can be mitigated by updating the `Vectorized` classes one by one.
 
-### Sideffects from non-constexpr size()
+### Side-effects from non-constexpr size()
 By changing the size() to non-constexpr, we will be changing a large part of the codebase which may cause regressions. These will need to be benchmarked thoroughly and if we choose to accept any regressions, they will need to be limited to aarch64 architectures.
 
 ### Memory footprint increase
 By storing an array with the size "max SVE vector length (2048 bits being the maximum possible and 512 bits being the longest hardware available)", the memory footprint is increased by `2048 bits x number of existing Vectorized classes`. Since `Vectorized` classes are created and destoryed in loops with only a few instances existing simultaneously, we expect this effect to be minimal, but we should benchmark this using actual models. We could also limit this effect by using the maximum vector size currently available on hardware with scope to change this if necessary.
 
+### Side-effects from using -O3
+Using `-O3` increases the package size of PyTorch, which is not an issue for use cases using the PyPI package where we already use `-O3` or for devices where package size is not a large concern such as infrastructure. However, it becomes an issue for clients such as mobile where package size may matter. In these instances where PyTorch is built using lower optimizaion flags, there will be a regression compared to the current implementation.
+
 ## **Benchmarking plan**
 To mitigate the risk from changing the size() from constexpr to const, we propose the following order of patches to PyTorch: