Skip to content

Conversation

@saifhhasan
Copy link

Background

Bootstrap operations enable Communicator to initialize. It happens as part of startup of a Job when PG/Communicators are created.

Bootstrap offers APIs to perform I/O over Sockets (e.g. AllGathers, Barrier, Send/Recv). For large scale jobs the order of bootstrap operations would have higher skew (e.g. late joiners) as well as overall I/O (number of messages & data-size) is large.

In our large scale training, we observed that Bootstrap chokes CPU at 100% (one core per process) and lasts much longer as job scaled. This competes with the other CPU intensive resources especially during startup phase of the job.

We found this by tracing CPU for the duration of Job's initialization. And socket.cc being the highest contributor for the same.

Screenshot 2025-07-01 at 11 06 42 AM

Fix

For non-blocking Socket Read/Writes, the function socketProgress which in-turn calls socketProgressOpt will return if there are no bytes to READ or WRITE. And socketWait will loop the socketProgress again causing the busy read/write loop.

The fix is to use poll for socketWait, that'll wait until the socket becomes readable / writable to avoid busy loop, and also permit us to abort the operation within 1ms duration. This is similar to what the existing code do for socketPollConnect.

@sjeaugey
Copy link
Member

sjeaugey commented Jul 2, 2025

Thanks Saif. I think I'd simplify / rework this a little bit to make it less intrusive in the main code:

  • Extract the poll call to a separate function.
  • Remove the op == NCCL_SOCKET_SEND || op == NCCL_SOCKET_RECV given those are the only two options.
  • Also add a NCCL_PARAM to control whether we want socket wait operations to be blocking or actively polling.

So something like:

NCCL_PARAM(SockWaitSleep, "SOCK_WAIT_SLEEP", 1);

static void sleepOnFd(int fd, int op) {
 struct pollfd pfd;
 int timeout = 1;
 pfd.fd = fd;
 pfd.events =  op == NCCL_SOCKET_SEND ? POLLOUT : POLLIN;
 poll(&pfd, 1, timeout);
}

static ncclResult_t socketWait(int op, struct ncclSocket* sock, void* ptr, int size, int* offset) {
 while (*offset < size) {
   NCCLCHECK(socketProgress(op, sock, ptr, size, offset));
   if (*offset < size && ncclParamSockWaitSleep()) sleepOnFd(sock->fd);
 }
 return ncclSuccess;
}

Let me know what you think. I didn't test the code (it may be broken), I only wrote it to illustrate the idea.

@saifhhasan
Copy link
Author

Hi @sjeaugey, thank you for the feedback. Let me iterate the pull request to address your comments. I'll name param as SOCKET_POLL_TIMEOUT_MS just to be semantically correct as it won't really sleep but rather wait until that timeout before returning.

p.s. Pardon me for delayed response as I've been on a leave for a while.

@saifhhasan saifhhasan force-pushed the meta-saif-socket-poll branch from 218f618 to 7b90715 Compare July 14, 2025 18:46
@saifhhasan
Copy link
Author

Hi @sjeaugey, I have updated the PR as per your suggestion.

  • Introduced a new param
  • Refactored out the socket poll in a helper function
  • Retested by running AllReduce perf exercising this code path

@sjeaugey
Copy link
Member

That patch looks pretty clean, plus the parameter to set the sleep time is nice. I don't think we'd enable it by default right away, but other than that, it seems like a good idea.

Build and run nccl_allreduce_perf ensuring Initialization path is
exercised for testing.
@saifhhasan saifhhasan force-pushed the meta-saif-socket-poll branch from 7b90715 to 3f5073b Compare July 15, 2025 17:52
@saifhhasan
Copy link
Author

Sounds good @sjeaugey .. Set the param 0 to disable by default. Thank you for helping review the diffs.

@saifhhasan
Copy link
Author

Hi @sjeaugey - Wondering if you can help take a look at this patch and get it going. Happy to iterate to ensure it adheres to NCCL team's practices and expectations.

@marksantesson
Copy link
Collaborator

marksantesson commented Jul 21, 2025

Hi @saifhhasan , I'm on the NCCL team and this is on my list of tasks to take up soon.

@marksantesson
Copy link
Collaborator

@saifhhasan, this has been merged into the v2.29 branch. Unfortunately, I did not get it approved in time for it to go through the v2.28 testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants