BLIS 0.9.0
This release contains a slew of improvements, new kernels and APIs, bugfixes, and more (including lots of code reduction). It also contains foundational support for an exciting new class of expert functionality: creating new operations without the need to duplicate the middleware that sits between the API and kernels.
Improvements present in 0.9.0:
Framework:
- Added various fields to
obj_tthat relate to storing function pointers to custompackmkernels, microkernels, etc as well as accessor functions to set and query those fields. (Devin Matthews) - Enabled user-customized
packmmicrokernels and variants via the aforementioned newobj_tfields. (Devin Matthews) - Moved edge-case handling out of the macrokernel and into the
gemmandgemmtrsmmicrokernels. This also required updating of APIs and definitions of all existing microkernels inkernelsdirectory. Edge-case handling functionality is now facilitated via new preprocessor macros found inbli_edge_case_macro_defs.h. (Devin Matthews) - Avoid
gemmsupthread barriers when not packing A or B. This boosts performance for many small multithreaded problems. (Field Van Zee, AMD) - Allow the 1m method to operate normally when single and double real-domain microkernels mix row and column I/O preference. (Field Van Zee, Devin Matthews, RuQing Xu)
- Removed support for execution of complex-domain level-3 operations via the 3m and 4m methods.
- Refactored
herk,her2k,syrk,syr2kin terms ofgemmt. (Devin Matthews) - Defined
setijvandgetijvto set/get vector elements. - Defined
eqsc,eqv, andeqmoperations to test equality between two scalars, vectors, or matrices. - Added new bounds checking to
setijmandgetijmto prevent use of negative indices. - Renamed
membrkfiles/variables/functions topba. - Store error-checking level as a thread-local variable. (Devin Matthews)
- Add
err_t*"return" parameter tobli_malloc_*()and friends. - Switched internal mutexes of the
sbaandpbato static initialization. - Changed return value method of
bli_pack_get_pack_a(),bli_pack_get_pack_b(). - Fixed a bug that allows
bli_init()to be called more than once (without segfaulting). (@lschork2, Minh Quan Ho, Devin Matthews) - Removed a sanity check in
bli_pool_finalize()that prevented BLIS from being re-initialized. (AMD) - Fixed insufficient
pool_t-growing logic inbli_pool.c, and always allocate at least one element in.block_ptrsarray. (Minh Quan Ho) - Cleanups related to the error message array in
bli_error.c. (Minh Quan Ho) - Moved language-related definitions from
bli_macro_defs.hto a new header,bli_lang_defs.h. - Renamed
BLIS_SIMD_NUM_REGISTERStoBLIS_SIMD_MAX_NUM_REGISTERSandBLIS_SIMD_SIZEtoBLIS_SIMD_MAX_SIZEfor improved clarity. (Devin Matthews) - Many minor bugfixes.
- Many cleanups, including removal of old and commented-out code.
Compatibility:
- Expanded BLAS layer to include support for
?axpby_()and?gemm_batch_(). (Meghana Vankadari, AMD) - Added
gemm3mAPIs to BLAS and CBLAS layers. (Bhaskar Nallani, AMD) - Handle
?gemm_()invocations where m or n is unit by calling?gemv_(). (Dipal M Zambare, AMD) - Removed option to finalize BLIS after every BLAS call.
- Updated default definitions of
bli_slamch()andbli_dlamch()to use constants from standard C library rather than values computed at runtime. (Devin Matthews)
Kernels:
- Added 512-bit SVE-based
a64fxsubconfiguration that uses empirically-tuned blocksizes (Stepan Nassyr, RuQing Xu) - Added a vector-length agnostic
armsvesubconfig that computes blocksizes via an analytical model. (Stepan Nassyr) - Added vector-length agnostic d/s/sh
gemmkernels for Arm SVE. (Stepan Nassyr) - Added
gemmsupkernels to thearmv8akernel set for use in new Apple Firestorm subconfiguration. (RuQing Xu) - Added 512-bit SVE
dpackmkernels (16xk and 10xk) with in-register transpose. (RuQing Xu) - Extended 256-bit SVE
dpackmkernels by Linaro Ltd. to 512-bit for size 12xk. (RuQing Xu) - Reorganized register usage in
bli_gemm_armv8a_asm_d6x8.cto accommodate clang. (RuQing Xu) - Added
saxpyf/daxpyf/caxpyfkernels tozenkernel set. (Dipal M Zambare, AMD) - Added
vzeroupperinstruction tohaswellmicrokernels. (Devin Matthews) - Added explicit
beta == 0handling in s/darmsveandarmv7agemmmicrokernels. (Devin Matthews) - Added a unique tag to branch labels to accommodate clang. (Devin Matthews, Jeff Hammond)
- Fixed a copy-paste bug in the loading of
kappa_iin the two assemblycpackmkernels inhaswellkernel set. (Devin Matthews) - Fixed a bug in Mx1
gemmsuphaswellkernels whereby thevhaddpdinstruction is used with uninitialized registers. (Devin Matthews) - Fixed a bug in the
power10microkernel I/O. (Nicholai Tukanov) - Many other Arm kernel updates and fixes. (RuQing Xu)
Extras:
- Added support for addons, which are similar to sandboxes but do not require the user to implement any particular operation.
- Added a new
gemmlikesandbox to allow rapid prototyping ofgemm-like operations. - Various updates and improvements to the
power10sandbox, including a new testsuite. (Nicholai Tukanov)
Build system:
- Added explicit support for AMD's Zen3 microarchitecture. (Dipal M Zambare, AMD, Field Van Zee)
- Added runtime microarchitecture detection for Arm. (Dave Love, RuQing Xu, Devin Matthews)
- Added a new
configureoption--[en|dis]able-amd-frame-tweaksthat allows BLIS to compile certain framework files (each with the_amdsuffix) that have been customized by AMD for improved performance (provided that the targeted configuration is eligible). By default, the more portable counterparts to these files are compiled. (Field Van Zee, AMD) - Added an explicit compiler predicate (
is_win) for Windows inconfigure. (Devin Matthews) - Use
-march=haswellinstead of-march=skylake-avx512on Windows. (Devin Matthews, @h-vetinari) - Fixed
configurebreakage on MacOSX by accepting eitherclangorLLVMin vendor string. (Devin Matthews) - Blacklist clang10/gcc9 and older for
armsvesubconfig. - Added a
configureoption to control whether or not to use@rpath. (Devin Matthews) - Added armclang detection to
configure. (Devin Matthews) - Use
@path-based install name on MacOSX and use relocatableRPATHentries for testsuite binaries. (Devin Matthews) - For environment variables
CC,CXX,FC,PYTHON,AR, andRANLIB,configurewill now print an error message and abort if a user specifies a specific tool and that tool is not found. (Field Van Zee, Devin Matthews) - Added symlink to
blis.pc.infor out-of-tree builds. (Andrew Wildman) - Register optimized real-domain
copyv,setv, andswapvkernels inzensubconfig. (Dipal M Zambare, AMD) - Added Apple Firestorm (A14/M1) subconfiguration,
firestorm. (RuQing Xu) - Added
armsvesubconfig toarm64configuration family. (RuQing Xu) - Allow using clang with the
thunderx2subconfiguration. (Devin Matthews) - Fixed a subtle substitution bug in
configure. (Chengguo Sun) - Updated top-level Makefile to reflect a dependency on the "flat"
blis.hfile for the BLIS and BLAS testsuite objects. (Devin Matthews) - Mark
xerbla_()as a "weak" symbol on MacOSX. (Devin Matthews) - Fixed a long-standing bug in
common.mkwhereby the header path tocblas.hwas omitted from the compiler flags when compiling CBLAS files within BLIS. - Added a custom-made recursive
sedscript tobuilddirectory. - Minor cleanups and fixes to
configure,common.mk, and others.
Testing:
- Fixed a race condition in the testsuite when the SALT option (simulate application-level threading) is enabled. (Devin Matthews)
- Test 1m method execution during
make check. (Devin Matthews) - Test
make installin Travis CI. (Devin Matthews) - Test C++ in Travis CI to make sure
blis.his C++-compatible. (Devin Matthews) - Disabled SDE testing of pre-Zen microarchitectures via Travis CI.
- Added Travis CI support for testing Arm SVE. (RuQing Xu)
- Updated SDE usage so that it is downloaded from a separate repository (ci-utils) in our GitHub organization. (Field Van Zee, Devin Matthews)
- Updated octave scripts in
test/3to be robust against missing datasets as well as to fixed a few minor issues. - Added
test_axpbyv.candtest_gemm_batch.ctest driver files totestdirectory. (Meghana Vankadari, AMD) - Support all four datatypes in
her,her2,herk, andher2kdrivers intestdirectory. (Madan mohan Manokar, AMD)
Documentation:
- Added documentation for:
setijv,getijv,eqsc,eqv,eqm. - Added
docs/Addons.md. - Added dedicated "Performance" and "Example Code" sections to
README.md. - Updated
README.md. - Updated
docs/Sandboxes.md. - Updated
docs/Multithreading.md. (Devin Matthews) - Updated
docs/KernelHowTo.md. - Updated
docs/Performance.mdto report Fujitsu A64fx (512-bit SVE) results. (RuQing Xu) - Updated
docs/Performance.mdto report Graviton2 Neoverse N1 results. (Nicholai Tukanov) - Updated
docs/FAQ.mdwith new questions. - Fixed typos in
docs/FAQ.md. (Gaëtan Cassiers) - Various other minor fixes.