Merge pull request #5065 from martin-frbg/changelog0329

martin-frbg · web-flow · commit 7f5b703a800e · 2025-01-12T04:36:39.000-08:00
Update the Changelog for version 0.3.29
diff --git a/Changelog.txt b/Changelog.txt
@@ -1,4 +1,99 @@
 OpenBLAS ChangeLog
+====================================================================
+Version 0.3.29
+12-Jan-2025
+
+general:
+ - fixed a potential NULL pointer dereference in multithreaded builds
+ - added function aliases for GEMMT using its new name GEMMTR adopted by Reference-BLAS
+ - fixed a build failure when building without LAPACK_DEPRECATED functions
+ - the minimum required CMake version for CMake-based builds was raised to 3.16.0 in order 
+   to remove many compatibility and deprecation warnings
+ - added more detailed CMake rules for OpenMP builds (mainly to support recent LLVM)
+ - fixed the behavior of the recently added CBLAS_?GEMMT functions with row-major data
+ - improved thread scaling of multithreaded SBGEMV 
+ - improved thread scaling of multithreaded TRTRI 
+ - fixed compilation of the CBLAS testsuite with gcc14 (and no Fortran compiler)
+ - added support for option handling changes in flang-new from LLVM18 onwards 
+ - added support for recent calling conventions changes in Cray and NVIDIA compilers
+ - added support for compilation with the NAG Fortran compiler
+ - fixed placement of the -fopenmp flag and libsuffix in the generated pkgconfig file
+ - improved the CMakeConfig file generated by the Makefile build
+ - fixed const-correctness of cblas_?geadd in cblas.h
+ - fixed a potential inaccuracy in multithreaded BLAS3 calls
+ - fixed empty implementations of get/set_affinity that print a warning in OpenMP builds
+ - fixed function signatures for TRTRS in the converted C version of LAPACK
+ - fixed omission of several single-precision LAPACK symbols in the shared library 
+ - improved build instructions for the provided "pybench" benchmarks
+ - improved documentation, including added build instructions for WoA and HarmonyOS
+   as well as descriptions of environment variables that affect build and runtime behavior
+ - added a separate "make install_tests" target for use with cross-compilations
+ - integrated improvements and corrections from Reference-LAPACK:
+   - removed a comparison in LAPACKE ?tpmqrt that is always false (LAPACK PR 1062)
+   - fixed the leading dimension for B in tests for GGEV (LAPACK PR 1064)
+   - replaced the ?LARFT functions with a recursive implementation (LAPACK PR 1080)
+
+arm:
+ - fixed build with recent versions of the NDK (missing .type declaration of symbols)
+
+arm64:
+ - fixed a long-standing bug in the (generic) c/zgemm_beta kernel that could lead to
+   reads and writes outside the array bounds in some circumstances
+ - rewrote cpu autodetection to scan all cores and return the highest performing type
+ - improved the DGEMM performance for SVE targets and small matrix sizes
+ - improved dimension criteria for forwarding from GEMM to GEMV kernels
+ - added SVE kernels for ROT and SWAP
+ - improved SVE kernels for SGEMV and DGEMV on A64FX and NEOVERSEV1
+ - added support for using the "small matrix" kernels with CMake as well
+ - fixed compilation on Windows on Arm
+ - improved compile-time detection of SVE capability
+ - added cpu autodetection and initial support for Apple M4
+ - added support for compilation on systems running IOS
+ - added support for compilation on NetBSD ("evbarm" architecture)
+ - fixed NRM2 implementations for generic SVE targets and the Neoverse N2
+ - fixed compilation for SVE-capable targets with the NVIDIA compiler
+
+x86_64:
+ - fixed a wrong storage size in the SBGEMV kernel for Cooper Lake
+ - added cpu autodetection for Intel Granite Rapids
+ - added cpu autodetection for AMD Ryzen 5 series
+ - added optimized SOMATCOPY_CT for AVX-capable targets
+ - fixed the fallback implementation of GEMM3M in GENERIC builds
+ - tentatively re-enabled builds with the EXPRECISION option
+ - worked around a miscompilation of tests with mingw32-gfortran14
+ - added support for compilation with the Intel oneAPI 2025.0 compiler on Windows
+
+power:
+ - fixed multithreaded SBGEMM
+ - fixed a CMake build problem on POWER10
+ - improved the performance of SGEMV
+ - added vectorized implementations of SBGEMV and support for forwarding 1xN SBGEMM to them
+ - fixed illegal instructions and potential memory overflow in SGEMM on PPCG4
+ - fixed handling of NaN and Inf arguments in SSCAL and DSCAL on PPC440,G4 and 970
+ - added improved CGEMM and ZGEMM kernels for POWER10
+ - added Makefile logic to remove all optimization flags in DEBUG builds
+
+mips64:
+ - fixed compilation with gcc14
+ - fixed GEMM parameter selection for the MIPS64_GENERIC target
+ - fixed a potential build failure when compiling with OpenMP
+
+loongarch64:
+ - fixed compilation for Loongson3 with recent versions of gmake
+ - fixed a potential loss of precision in Loongson3A GEMM
+ - fixed a potential build failure when compiling with OpenMP
+ - added optimized SOMATCOPY for LASX-capable targets
+ - introduced a new cpu naming scheme while retaining compatibility
+ - added support for cross-compiling Loongarch64 targets with CMake
+ - added support for compilation with LLVM
+
+riscv64:
+ - removed thread yielding overhead caused by sched_yield
+ - replaced some non-standard intrinsics with their official names
+ - fixed and sped up the implementations of CGEMM/ZGEMM TCOPY for vector lenghts 128 and 256
+ - improved the performance of SNRM2/DNRM2 for RVV1.0 targets
+ - added optimized ?OMATCOPY_CN kernels for RVV1.0 targets
+
 ====================================================================
 Version 0.3.28
  8-Aug-2024