Description
Hello,
I have noticed multiple issues within the step()
function of ray/tuner.py
, some of which prevent me from having an uninterrupted hyperparameter tuning session with ray. Here are the issues with possible workarounds:
-
There is the following loop to idle until the incoming data is updated:
IsaacLab/scripts/reinforcement_learning/ray/tuner.py
Lines 115 to 117 in 7de6d6f
However, due to the keyword"done"
we insert intoself.data
at each loop, thedata
andself.data
can never be equal, even if the underlying data are equal ("done"
will be absent withindata
).I suggest we change this part to something like:
data_ = {k: v for k, v in data.items() if k != "done"} self_data_ = {k: v for k, v in self.data.items() if k != "done"} while util._dicts_equal(data_, self_data_): data = util.load_tensorboard_logs(self.tensorboard_logdir) data_ = {k: v for k, v in data.items() if k != "done"} sleep(2) # Lazy report metrics to avoid performance overhead
-
Update to the
self.data["done"]
to mark the run as finished happens currently here:
IsaacLab/scripts/reinforcement_learning/ray/tuner.py
Lines 104 to 105 in 7de6d6f
However, from time to time, I notice that the process that executes the training takes a while to return after the end of the training, and we end up inside the following loop (after the fix from bullet 1):
IsaacLab/scripts/reinforcement_learning/ray/tuner.py
Lines 115 to 117 in 7de6d6f
from which we can never exit (since the data is not updated anymore, and we don't check if the process returned or not). Consequently, the ray stuck there.I suggest we change both of the while loops as follows:
while data is None: data = util.load_tensorboard_logs(self.tensorboard_logdir) sleep(2) # Lazy report metrics to avoid performance overhead proc_status = self.proc.poll() if proc_status is not None: break if self.data is not None: data_ = {k: v for k, v in data.items() if k != "done"} self_data_ = {k: v for k, v in self.data.items() if k != "done"} while util._dicts_equal(data_, self_data_): data = util.load_tensorboard_logs(self.tensorboard_logdir) data_ = {k: v for k, v in data.items() if k != "done"} sleep(2) # Lazy report metrics to avoid performance overhead proc_status = self.proc.poll() if proc_status is not None: break
-
Finally, while this might not necessarily be an issue directly related to IsaacLab, I noticed that sometimes the process executing the training hangs right after the end of the training forever (maybe at
simulation_app.close()
?), hence halting all the ray process as we never can mark the run as finished.While it might not be the best solution, I applied the following patch as a workaround, and it seems to work for me:
if self.data is not None: data_ = {k: v for k, v in data.items() if k != "done"} self_data_ = {k: v for k, v in self.data.items() if k != "done"} time_start = time.time() while util._dicts_equal(data_, self_data_): self.data_freeze_duration = time.time() - time_start data = util.load_tensorboard_logs(self.tensorboard_logdir) data_ = {k: v for k, v in data.items() if k != "done"} sleep(2) # Lazy report metrics to avoid performance overhead proc_status = self.proc.poll() if proc_status is not None: break if self.data_freeze_duration > SOME_THRESHOLD: self.data_freeze_duration = 0.0 self.proc.terminate() try: retcode = self.proc.wait(timeout=20) except Exception as e: raise ValueError("The frozen process did not terminate within timeout duration.") from e self.data = data self.data["done"] = True return self.data
Additional context
I have tested these only on a single GPU (4090 RTX) and with the rsl_rl library.
System Info
Commit: bc7c9f5
Isaac Sim Version: 4.5
OS: Ubuntu 22.04
GPU: 4090 RTX
CUDA: 12.2
GPU Driver: 535.129.03
Checklist
- I have checked that there is no similar issue in the repo (required)
- I have checked that the issue is not in running Isaac Sim itself and is related to the repo
Acceptance Criteria
- Ray runs without interruptions or cover from interruptions