Releases: ROCm/Tensile
Releases · ROCm/Tensile
Tensile 4.34.0 for ROCm 5.3.0
Added
- Lazy loading of solution libraries and code object files
- Support for dictionary style logic files
- Support for decision tree based logic files using dictionary format
- DecisionTreeLibrary for solution selection
- DirectToLDS support for HGEMM
- DirectToVgpr support for SGEMM
- Grid based distance metric for solution selection
- Support for gfx11xx
- Support for DirectToVgprA/B + TLU=False
- ForkParameters Groups as a way of specifying solution parameters
- Support for a new Tensile yaml config format
- TensileClientConfig for generating Tensile client config files
- Options for TensileCreateLibrary to build client and create client config file
Optimizations
- Solution generation is now cached and is not repeated if solution parameters are unchanged
Changed
- Default MACInstruction to FMA
Fixed
- Accept StaggerUStride=0 as valid
- Reject invalid data types for UnrollLoopEfficiencyEnable
- Fix invalid code generation issues related to DirectToVgpr
- Return hipErrorNotFound if no modules are loaded
- Fix performance drop for NN ZGEMM with 96x64 macro tile
- Fix memory violation for general batched kernels when alpha/beta/K = 0
Tensile 4.33.0 for ROCm 5.2.3
Tensile code for ROCm 5.2.3 did not change. The library was rebuilt for the updated ROCm 5.2.3 stack.
Tensile 4.33.0 for ROCm 5.2.1
Tensile code for ROCm 5.2.1 did not change. The library was rebuilt for the updated ROCm 5.2.1 stack.
Tensile 4.33.0 for ROCm 5.2.0
Added
- TensileUpdateLibrary for updating old library logic files
- Support for TensileRetuneLibrary to use sizes from separate file
- ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support
- Tests for denorm correctness
- Option to write different architectures to different TensileLibrary files
Optimizations
- Optimize MessagePackLoadLibraryFile by switching to fread
- DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr
Changed
- Alpha/beta datatype remains as F32 for HPA HGEMM
- Force assembly kernels to not flush denorms
- Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount
Fixed
- Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80
Tensile 4.32.0 for ROCm 5.1.3
Tensile code for ROCm 5.1.3 did not change. The library was rebuilt for the updated ROCm 5.1.3 stack.
Tensile 4.32.0 for ROCm 5.1.1
Tensile code for ROCm 5.1.1 did not change. The library was rebuilt for the updated ROCm 5.1.1 stack.
Tensile 4.32.0 for ROCm 5.1.0
Added
- Better control of parallelism to control memory usage
- Support for multiprocessing on Windows for TensileCreateLibrary
- New JSD metric and metric selection functionality
- Initial changes to support two-tier solution selection
Optimized
- Optimized runtime of TensileCreateLibraries by reducing max RAM usage
- StoreCInUnroll additional optimizations plus adaptive K support
- DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support
Changed
- Update Googletest to 1.11.0
Removed
- Remove no longer supported benchmarking steps
Tensile 4.31.0 for ROCm 5.0.2
Tensile code for ROCm 5.0.2 is unchanged from Tensile for ROCm 5.0.1. The library was rebuilt for the updated ROCm 5.0.2 stack.
Tensile 4.31.0 for ROCm 5.0.1
Tensile code for ROCm 5.0.1 is unchanged from Tensile for ROCm 5.0.0. The library was rebuilt for the updated ROCm 5.0.1 stack.
Tensile 4.31.0 for ROCm 5.0.0
Added
- DirectToLds support (x2/x4)
- DirectToVgpr support for DGEMM
- Parameter to control number of files kernels are merged into to better parallelize kernel compilation
- FP16 alternate implementation for HPA HGEMM on aldebaran
Optimized
- Add DGEMM NN custom kernel for HPL on aldebaran
Changed
- Update tensile_client executable to std=c++14
Removed
- Remove unused old Tensile client code
Fixed
- Fix hipErrorInvalidHandle during benchmarks
- Fix addrVgpr for atomic GSU
- Fix for Python 3.8: add case for Constant nodeType
- Fix architecture mapping for gfx1011 and gfx1012
- Fix PrintSolutionRejectionReason verbiage in KernelWriter.py
- Fix vgpr alignment problem when enabling flat buffer load