Releases · ROCm/Tensile

29 Aug 21:05

guacamoleo

v3.0.4

98cf4f2

v3.0.4 - Fixed NaN propagation

When Beta==0, kernels write to C tensor without reading from it.

Assets 2

30 May 18:31

guacamoleo

v3.0.0

2d1d9ff

v3.0.0 - GlobalSplitU and Improved Benchmarking / Library Logic

GlobalSplitU: On top of LocalSplitU, Tensile now supporting splitting up the summation between work-groups. This option requires a beta-only kernel followed by a gemm kernel which uses atomic compare-and-swap to accumulate results in global memory. This feature increases the number of work-groups while maintaining tile size, with the drawback of slower global memory accumulation.

Improved Benchmarking / Library Logic:

Users can perform multiple benchmark runs for a single problem type; this allows for tuning multiple problem size groups.
Users can specify multiple problem size ranges as well as exact sizes to do training and logic generation for.
Users can label a benchmark with a schedule name and a list of devices which the schedule supports. Tensile will choose solution schedule based on device.

Semantic Versioning: Users can specify minimum Tensile versions in yaml files to guarantee support & compatibility.

Expanded Work-Group and Thread-Tile Sizes: Users can explicitly specify work-group sizes and thread-tile sizes which are not powers of 2 and not even even.

Maximum Occupancy: For problem sizes or strides which are known to thrash the gpu caches, users can manually lower the occupancy of the work-groups to try and improve performance.

Assets 2

27 Apr 14:50

guacamoleo

v2.4.5

b5d359d

v2.4.5 - Prefetching and Half-Precision

Prefetch Global -> Local
Issues loads from global memory into lds memory one full iteration in advance. This uses double the lds but hides global memory latency better.

Prefetch Local -> Registers
Issues loads from lds into registers one unrolled iteration in advance. This uses several extra registers but hides lds latency better.

Half-Precision
"half"/__fp16 is now a supported data type.

TensileBenchmarkLibraryClient.py
This python script takes a library client executable and a csv file of data sizes as inputs, and will run the executable on the data sizes.

Assets 2

07 Apr 13:47

guacamoleo

v2.3.0

43a6c36

v2.3.0 - Short-Vectors and Pointer-Shifting

Short-Vectors:
The kernels can now operate on float2* or float4* pointers which means the reads and writes to memory are denser and require fewer registers to store addresses. When reading/writing vectors and transposing the matrix, Tensile can hand reading vectors and writing components or reading components and writing vectors.

Pointer-Shifting:
Rather than having to use branches to guard against reading out-of-bounds, Tensile can now shift the read pointers to read in bounds before the main summation loop, then reorganize the accumulation registers after the main loop before writing the results. The result of this is protection against reading out-of-bounds when tensor sizes are not exact multiples of kernel tile sizes, but without having branch code in the main summation loop.

Others:

Kernels have more flexibility as to which threads are assigned to load which elements from global memory.
Benchmarking protocol can handle benchmarking a single kernel configuration.
Library logic analysis can handle generating a library backend from a single data point, i.e., library will consist of single fastest kernel at single data point.
Library logic analysis bug fix for when only a single solution is fastest for all data points.

Assets 2

30 Mar 17:18

guacamoleo

v2.2.3

98d1340

v2.2.3 - SplitU and WorkGroupMapping

SplitU
If you have large summations but small C tensor, then you can create extra parallelism by splitting up the summation; This allows smaller C tensors to fill up larger GPUs.

WorkGroupMapping
Changes which work-groups operate on which tiles of tensor C. This can help performance by improving caching.

Assets 2

13 Mar 16:46

guacamoleo

v2.2.0

48610a1

v2.2.0 - Recursive Solution Selection Logic

Rather than choosing solutions based on size=M*N, the recursive solution selection logic (SSL) now chooses solutions based on M, N and K, by recursively partitioning the dimensions.

Assets 2

23 Feb 21:12

guacamoleo

v2.0.0

81ffaca

v2.0.0 - Benchmarking Overhaul: Faster, Simpler, Programmable

The benchmarking protocol has been completely re-designed to use config.yaml files rather an applications needing to generate problem.xml files.

Tensile is now an installable python module.

Please read the wiki to understand all the new features.

Assets 2

26 Jan 17:18

guacamoleo

v1.1.0

777931a

v1.1.0 - Bug Fixes

Several bug fixes for rocBLAS.

Assets 2

15 Aug 20:53

guacamoleo

v0.1

87ca3db

v0.1 - Preview Release Pre-release

Pre-release

Full support for tensor contractions for BLAS and DNN.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ROCm/Tensile

v3.0.4 - Fixed NaN propagation

Uh oh!

v3.0.0 - GlobalSplitU and Improved Benchmarking / Library Logic

Uh oh!

v2.4.5 - Prefetching and Half-Precision

Uh oh!

v2.3.0 - Short-Vectors and Pointer-Shifting

Uh oh!

v2.2.3 - SplitU and WorkGroupMapping

Uh oh!

v2.2.0 - Recursive Solution Selection Logic

Uh oh!

v2.0.0 - Benchmarking Overhaul: Faster, Simpler, Programmable

Uh oh!

v1.1.0 - Bug Fixes

Uh oh!

v0.1 - Preview Release

Uh oh!