v1.4.4-rc1
Pre-release
Pre-release
·
1 commit
to v1.4.x
since this release
New Features and Enhancements
Core
- Implemented asymmetric memory support {PR #1000}
- Enhanced error handling and resource cleanup {PR #960, #951}
- Improved service team handling {PR #1046}
- Fixed triggered post for zero size collectives {PR #960}
CL/HIER
- Added allgatherv support {PR #1111}
- Implemented node subgroup unpacking {PR #1103}
- Added reduce to supported collectives {PR #997}
- Fixed integer overflow in alltoall {PR #944}
TL/UCP
- Split single and multithreaded send/receive operations {PR #1109}
- Added knomial allgather with CUDA memory support {PR #1095}
- Implemented reduce SRG knomial algorithm {PR #1058}
- Added radix selection to knomial operations {PR #1072}
- Added sliding window allreduce implementation {PR #958}
- Added knomial allgatherv support {PR #1008}
- Added sparbit algorithm for allgather {PR #940}
- Extended broadcast active set support for size > 2 {PR #926}
- Added knomial algorithm for reduce-scatter {PR #970}
TL/MLX5
- Added multicast-based zero-copy broadcast {PR #1087}
- Implemented mcast multi-group support {PR #1060}
- Added non-blocking CUDA memory copy support {PR #1040}
- Added device memory multicast broadcast {PR #989}
- Enhanced mcast allgather staging-based algorithm {PR #994}
- Improved one-sided mcast reliability initialization {PR #980}
- Various performance optimizations in alltoall {PR #1067}
- Fixed fences in all-to-all WQEs {PR #1069}
- Added context option to disable all-to-all operations {PR #1062}
- Improved error handling and device checks {PR #1102}
- Disabled mcast for thread multiple mode {PR #961}
TL/SHARP
- Added support for allgather operation {PR #1081}
- Enabled reduce-scatter with SAT support {PR #1084}
- Added SHARP multi-channel support {PR #1049}
- Fixed service team OOB handling {PR #1001}
- Improved internal OOB usage {PR #986}
CUDA
- Added linear broadcast implementation {PR #948}
- Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
- Enhanced error handling for CUDA context operations {PR #1025}
- Fixed context cleanup in CUDA operations {PR #954}
Build and Test
- Added support for specific GPU architectures with ROCM {PR #987}
- Added UCC pkg-config support {PR #1036}
- Fixed build compatibility with NVC compiler {PR #1052}
- Enhanced config parser functionality {PR #1092}
- Enhanced ASAN/LSAN memory leak detection {PR #1074}
- Added error checking and exit handling in gtests {PR #1083}
Documentation
- Updated README with UCC publication information {PR #1028}
- Added DOCA_UROM documentation {PR #999}
- Fixed Doxygen documentation issues {PR #1038}
- Enhanced code style consistency {PR #1020}
CL/DOCA_UROM
- Implemented new DOCA UROM plugin {PR #978}
- Added support for offloading collective operations to DPUs
- Implemented allreduce collective