Skip to content

v1.4.4-rc1

Pre-release
Pre-release
Compare
Choose a tag to compare
@Sergei-Lebedev Sergei-Lebedev released this 15 Apr 08:06
· 1 commit to v1.4.x since this release

New Features and Enhancements

Core

  • Implemented asymmetric memory support {PR #1000}
  • Enhanced error handling and resource cleanup {PR #960, #951}
  • Improved service team handling {PR #1046}
  • Fixed triggered post for zero size collectives {PR #960}

CL/HIER

  • Added allgatherv support {PR #1111}
  • Implemented node subgroup unpacking {PR #1103}
  • Added reduce to supported collectives {PR #997}
  • Fixed integer overflow in alltoall {PR #944}

TL/UCP

  • Split single and multithreaded send/receive operations {PR #1109}
  • Added knomial allgather with CUDA memory support {PR #1095}
  • Implemented reduce SRG knomial algorithm {PR #1058}
  • Added radix selection to knomial operations {PR #1072}
  • Added sliding window allreduce implementation {PR #958}
  • Added knomial allgatherv support {PR #1008}
  • Added sparbit algorithm for allgather {PR #940}
  • Extended broadcast active set support for size > 2 {PR #926}
  • Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

  • Added multicast-based zero-copy broadcast {PR #1087}
  • Implemented mcast multi-group support {PR #1060}
  • Added non-blocking CUDA memory copy support {PR #1040}
  • Added device memory multicast broadcast {PR #989}
  • Enhanced mcast allgather staging-based algorithm {PR #994}
  • Improved one-sided mcast reliability initialization {PR #980}
  • Various performance optimizations in alltoall {PR #1067}
  • Fixed fences in all-to-all WQEs {PR #1069}
  • Added context option to disable all-to-all operations {PR #1062}
  • Improved error handling and device checks {PR #1102}
  • Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

  • Added support for allgather operation {PR #1081}
  • Enabled reduce-scatter with SAT support {PR #1084}
  • Added SHARP multi-channel support {PR #1049}
  • Fixed service team OOB handling {PR #1001}
  • Improved internal OOB usage {PR #986}

CUDA

  • Added linear broadcast implementation {PR #948}
  • Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
  • Enhanced error handling for CUDA context operations {PR #1025}
  • Fixed context cleanup in CUDA operations {PR #954}

Build and Test

  • Added support for specific GPU architectures with ROCM {PR #987}
  • Added UCC pkg-config support {PR #1036}
  • Fixed build compatibility with NVC compiler {PR #1052}
  • Enhanced config parser functionality {PR #1092}
  • Enhanced ASAN/LSAN memory leak detection {PR #1074}
  • Added error checking and exit handling in gtests {PR #1083}

Documentation

  • Updated README with UCC publication information {PR #1028}
  • Added DOCA_UROM documentation {PR #999}
  • Fixed Doxygen documentation issues {PR #1038}
  • Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

  • Implemented new DOCA UROM plugin {PR #978}
  • Added support for offloading collective operations to DPUs
  • Implemented allreduce collective