Skip to content

[Bug Report] Multiple issues within ray/tuner.py #2328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
ozhanozen opened this issue Apr 17, 2025 · 4 comments · May be fixed by #2346
Open
2 of 3 tasks

[Bug Report] Multiple issues within ray/tuner.py #2328

ozhanozen opened this issue Apr 17, 2025 · 4 comments · May be fixed by #2346

Comments

@ozhanozen
Copy link
Contributor

Hello,

I have noticed multiple issues within the step() function of ray/tuner.py, some of which prevent me from having an uninterrupted hyperparameter tuning session with ray. Here are the issues with possible workarounds:

  1. There is the following loop to idle until the incoming data is updated:

    while util._dicts_equal(data, self.data):
    data = util.load_tensorboard_logs(self.tensorboard_logdir)
    sleep(2) # Lazy report metrics to avoid performance overhead

    However, due to the keyword "done" we insert into self.data at each loop, the data and self.data can never be equal, even if the underlying data are equal ("done" will be absent within data).

    I suggest we change this part to something like:

    data_ = {k: v for k, v in data.items() if k != "done"}
    self_data_ = {k: v for k, v in self.data.items() if k != "done"}
    while util._dicts_equal(data_, self_data_):
       data = util.load_tensorboard_logs(self.tensorboard_logdir)
       data_ = {k: v for k, v in data.items() if k != "done"}
       sleep(2)  # Lazy report metrics to avoid performance overhead
    
  2. Update to the self.data["done"] to mark the run as finished happens currently here:

    if proc_status is not None: # process finished, signal finish
    self.data["done"] = True

    However, from time to time, I notice that the process that executes the training takes a while to return after the end of the training, and we end up inside the following loop (after the fix from bullet 1):
    while util._dicts_equal(data, self.data):
    data = util.load_tensorboard_logs(self.tensorboard_logdir)
    sleep(2) # Lazy report metrics to avoid performance overhead

    from which we can never exit (since the data is not updated anymore, and we don't check if the process returned or not). Consequently, the ray stuck there.

    I suggest we change both of the while loops as follows:

    while data is None:
       data = util.load_tensorboard_logs(self.tensorboard_logdir)
       sleep(2)  # Lazy report metrics to avoid performance overhead
       proc_status = self.proc.poll()
       if proc_status is not None:
          break
    
    if self.data is not None:
       data_ = {k: v for k, v in data.items() if k != "done"}
       self_data_ = {k: v for k, v in self.data.items() if k != "done"}
       while util._dicts_equal(data_, self_data_):
          data = util.load_tensorboard_logs(self.tensorboard_logdir)
          data_ = {k: v for k, v in data.items() if k != "done"}
          sleep(2)  # Lazy report metrics to avoid performance overhead
          proc_status = self.proc.poll()
          if proc_status is not None:
             break
    
  3. Finally, while this might not necessarily be an issue directly related to IsaacLab, I noticed that sometimes the process executing the training hangs right after the end of the training forever (maybe at simulation_app.close() ?), hence halting all the ray process as we never can mark the run as finished.

    While it might not be the best solution, I applied the following patch as a workaround, and it seems to work for me:

    if self.data is not None:
      data_ = {k: v for k, v in data.items() if k != "done"}
      self_data_ = {k: v for k, v in self.data.items() if k != "done"}
      time_start = time.time()
      while util._dicts_equal(data_, self_data_):
         self.data_freeze_duration = time.time() - time_start
         data = util.load_tensorboard_logs(self.tensorboard_logdir)
         data_ = {k: v for k, v in data.items() if k != "done"}
         sleep(2)  # Lazy report metrics to avoid performance overhead
         proc_status = self.proc.poll()
         if proc_status is not None:
            break
         if self.data_freeze_duration > SOME_THRESHOLD:
            self.data_freeze_duration = 0.0
            self.proc.terminate()
            try:
               retcode = self.proc.wait(timeout=20)
            except Exception as e:
               raise ValueError("The frozen process did not terminate within timeout duration.") from e
               self.data = data
               self.data["done"] = True
               return self.data
    

Additional context

I have tested these only on a single GPU (4090 RTX) and with the rsl_rl library.

System Info

Commit: bc7c9f5
Isaac Sim Version: 4.5
OS: Ubuntu 22.04
GPU: 4090 RTX
CUDA: 12.2
GPU Driver: 535.129.03

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have checked that the issue is not in running Isaac Sim itself and is related to the repo

Acceptance Criteria

  • Ray runs without interruptions or cover from interruptions
@RandomOakForest
Copy link
Collaborator

Thanks for posting this. The team will review.

@kellyguo11
Copy link
Contributor

Thanks for the report! Tagging @glvov-bdai to get his feedback.

@garylvov
Copy link
Contributor

I'd really appreciate if you could PR these changes, they look good to me! Sorry about these issues @ozhanozen thank you for looking into how to fix this!

If you have these on a a branch/PR I can test this on my multi-gpu machine and confirm that these changes work.

@kellyguo11 just fyi my internship at the RAI Institute ends at the end of the month so I'll no longer have access/get notified on the @glvov-bdai account feel free to ping @garylvov instead moving forward

@ozhanozen ozhanozen linked a pull request Apr 21, 2025 that will close this issue
6 tasks
@ozhanozen
Copy link
Contributor Author

Hi @garylvov, no worries, thank you for integrating Ray, it is quite useful for us.

I have created a PR for these changes as you have suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants