v2.3.0 - Short-Vectors and Pointer-Shifting
Short-Vectors:
The kernels can now operate on float2* or float4* pointers which means the reads and writes to memory are denser and require fewer registers to store addresses. When reading/writing vectors and transposing the matrix, Tensile can hand reading vectors and writing components or reading components and writing vectors.
Pointer-Shifting:
Rather than having to use branches to guard against reading out-of-bounds, Tensile can now shift the read pointers to read in bounds before the main summation loop, then reorganize the accumulation registers after the main loop before writing the results. The result of this is protection against reading out-of-bounds when tensor sizes are not exact multiples of kernel tile sizes, but without having branch code in the main summation loop.
Others:
- Kernels have more flexibility as to which threads are assigned to load which elements from global memory.
- Benchmarking protocol can handle benchmarking a single kernel configuration.
- Library logic analysis can handle generating a library backend from a single data point, i.e., library will consist of single fastest kernel at single data point.
- Library logic analysis bug fix for when only a single solution is fastest for all data points.