Releases: ROCm/Tensile
v3.0.4 - Fixed NaN propagation
When Beta==0, kernels write to C tensor without reading from it.
v3.0.0 - GlobalSplitU and Improved Benchmarking / Library Logic
GlobalSplitU: On top of LocalSplitU, Tensile now supporting splitting up the summation between work-groups. This option requires a beta-only kernel followed by a gemm kernel which uses atomic compare-and-swap to accumulate results in global memory. This feature increases the number of work-groups while maintaining tile size, with the drawback of slower global memory accumulation.
Improved Benchmarking / Library Logic:
- Users can perform multiple benchmark runs for a single problem type; this allows for tuning multiple problem size groups.
- Users can specify multiple problem size ranges as well as exact sizes to do training and logic generation for.
- Users can label a benchmark with a schedule name and a list of devices which the schedule supports. Tensile will choose solution schedule based on device.
Semantic Versioning: Users can specify minimum Tensile versions in yaml files to guarantee support & compatibility.
Expanded Work-Group and Thread-Tile Sizes: Users can explicitly specify work-group sizes and thread-tile sizes which are not powers of 2 and not even even.
Maximum Occupancy: For problem sizes or strides which are known to thrash the gpu caches, users can manually lower the occupancy of the work-groups to try and improve performance.
v2.4.5 - Prefetching and Half-Precision
Prefetch Global -> Local
Issues loads from global memory into lds memory one full iteration in advance. This uses double the lds but hides global memory latency better.
Prefetch Local -> Registers
Issues loads from lds into registers one unrolled iteration in advance. This uses several extra registers but hides lds latency better.
Half-Precision
"half"/__fp16 is now a supported data type.
TensileBenchmarkLibraryClient.py
This python script takes a library client executable and a csv file of data sizes as inputs, and will run the executable on the data sizes.
v2.3.0 - Short-Vectors and Pointer-Shifting
Short-Vectors:
The kernels can now operate on float2* or float4* pointers which means the reads and writes to memory are denser and require fewer registers to store addresses. When reading/writing vectors and transposing the matrix, Tensile can hand reading vectors and writing components or reading components and writing vectors.
Pointer-Shifting:
Rather than having to use branches to guard against reading out-of-bounds, Tensile can now shift the read pointers to read in bounds before the main summation loop, then reorganize the accumulation registers after the main loop before writing the results. The result of this is protection against reading out-of-bounds when tensor sizes are not exact multiples of kernel tile sizes, but without having branch code in the main summation loop.
Others:
- Kernels have more flexibility as to which threads are assigned to load which elements from global memory.
- Benchmarking protocol can handle benchmarking a single kernel configuration.
- Library logic analysis can handle generating a library backend from a single data point, i.e., library will consist of single fastest kernel at single data point.
- Library logic analysis bug fix for when only a single solution is fastest for all data points.
v2.2.3 - SplitU and WorkGroupMapping
SplitU
If you have large summations but small C tensor, then you can create extra parallelism by splitting up the summation; This allows smaller C tensors to fill up larger GPUs.
WorkGroupMapping
Changes which work-groups operate on which tiles of tensor C. This can help performance by improving caching.
v2.2.0 - Recursive Solution Selection Logic
Rather than choosing solutions based on size=M*N, the recursive solution selection logic (SSL) now chooses solutions based on M, N and K, by recursively partitioning the dimensions.
v2.0.0 - Benchmarking Overhaul: Faster, Simpler, Programmable
The benchmarking protocol has been completely re-designed to use config.yaml files rather an applications needing to generate problem.xml files.
Tensile is now an installable python module.
Please read the wiki to understand all the new features.
v1.1.0 - Bug Fixes
Several bug fixes for rocBLAS.
v0.1 - Preview Release
Full support for tensor contractions for BLAS and DNN.