v1.4.4-rc1

Pre-release

Pre-release

Sergei-Lebedev released this 15 Apr 08:06

· 1 commit to v1.4.x since this release

d733d17

New Features and Enhancements

Core

Implemented asymmetric memory support {PR #1000}
Enhanced error handling and resource cleanup {PR #960, #951}
Improved service team handling {PR #1046}
Fixed triggered post for zero size collectives {PR #960}

CL/HIER

Added allgatherv support {PR #1111}
Implemented node subgroup unpacking {PR #1103}
Added reduce to supported collectives {PR #997}
Fixed integer overflow in alltoall {PR #944}

TL/UCP

Split single and multithreaded send/receive operations {PR #1109}
Added knomial allgather with CUDA memory support {PR #1095}
Implemented reduce SRG knomial algorithm {PR #1058}
Added radix selection to knomial operations {PR #1072}
Added sliding window allreduce implementation {PR #958}
Added knomial allgatherv support {PR #1008}
Added sparbit algorithm for allgather {PR #940}
Extended broadcast active set support for size > 2 {PR #926}
Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

Added multicast-based zero-copy broadcast {PR #1087}
Implemented mcast multi-group support {PR #1060}
Added non-blocking CUDA memory copy support {PR #1040}
Added device memory multicast broadcast {PR #989}
Enhanced mcast allgather staging-based algorithm {PR #994}
Improved one-sided mcast reliability initialization {PR #980}
Various performance optimizations in alltoall {PR #1067}
Fixed fences in all-to-all WQEs {PR #1069}
Added context option to disable all-to-all operations {PR #1062}
Improved error handling and device checks {PR #1102}
Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

Added support for allgather operation {PR #1081}
Enabled reduce-scatter with SAT support {PR #1084}
Added SHARP multi-channel support {PR #1049}
Fixed service team OOB handling {PR #1001}
Improved internal OOB usage {PR #986}

CUDA

Added linear broadcast implementation {PR #948}
Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
Enhanced error handling for CUDA context operations {PR #1025}
Fixed context cleanup in CUDA operations {PR #954}

Build and Test

Added support for specific GPU architectures with ROCM {PR #987}
Added UCC pkg-config support {PR #1036}
Fixed build compatibility with NVC compiler {PR #1052}
Enhanced config parser functionality {PR #1092}
Enhanced ASAN/LSAN memory leak detection {PR #1074}
Added error checking and exit handling in gtests {PR #1083}

Documentation

Updated README with UCC publication information {PR #1028}
Added DOCA_UROM documentation {PR #999}
Fixed Doxygen documentation issues {PR #1038}
Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

Implemented new DOCA UROM plugin {PR #978}
Added support for offloading collective operations to DPUs
Implemented allreduce collective

Assets 2