-
Notifications
You must be signed in to change notification settings - Fork 244
Description
Is your feature request related to a problem? Please describe.
The braket_container.py
script used for the CUDA-Q BYOC image to launch the user-provided algorithm script is not thread safe, which can create race conditions in paritcular in the step to download, extract and make available the customer code to be executed in the job. This becomes a problem, specifically, for (single and multi-instance) multi-GPU workflows, an area where CUDA-Q can provide acceleration, in particular. While the original script on the amazon-braket-containers repository, does not take into account multiple processes running in an MPI context, at all, the script in this repository at least performs some basic handling of the MPI ranks here:
# Add wait time to resolve race condition
import time
rank = int(os.getenv("OMPI_COMM_WORLD_NODE_RANK", "0"))
time.sleep(rank)
But, this handling is both, inefficient, and ultimately not bullet proof (for example, if the download of the user-provided algorithm code from S3 takes longer than expected).
Describe the solution you'd like
The script should be refactored for real thread safety.
Describe alternatives you've considered
It would be even better, IMO, to improve the original script (https://github.com/amazon-braket/amazon-braket-containers/blob/main/src/braket_container.py) and copy it directly in the Dockerfile rather than duplicating it locally, e.g.:
FROM ...
# other instructions...
RUN git clone --depth=1 https://github.com/amazon-braket/amazon-braket-containers.git
RUN cp amazon-braket-containers/src/braket_container.py /opt/ml/code/braket_container.py
ENV SAGEMAKER_PROGRAM=braket_container.py