PPO outputs NaNs action values #321

giangdao1402 · 2025-05-19T11:43:54Z

giangdao1402
May 19, 2025

Hi @Toni-SM
Thank for your good work !
Currently i am training differential mobile robot (2 wheels) using PPO in IssacLab
But there is a problem, my network output nan action value: acts : tensor([[nan, nan]], device='cuda:0')
When i use the default training template with .yaml config, everything works well, but when i try to implement my own network to clip action, errors occur
My simple network is shown as below:

class Policy(GaussianMixin, Model):
    def __init__(self, observation_space, action_space, device, clip_actions=False,
                 clip_log_std=True, min_log_std=-20, max_log_std=2, reduction="sum"):
        Model.__init__(self, observation_space, action_space, device)
        GaussianMixin.__init__(self, clip_actions, clip_log_std, min_log_std, max_log_std, reduction)

        self.net = nn.Sequential(nn.Linear(self.num_observations, 512),
                                 nn.ELU(),
                                 nn.Linear(512, 256),
                                 nn.ELU(),
                                 nn.Linear(256, 64),
                                 nn.ELU(),
                                 nn.Linear(64, self.num_actions))
        self.log_std_parameter = nn.Parameter(torch.zeros(self.num_actions))

    def compute(self, inputs, role):
        raw = torch.tanh(self.net(inputs["states"]))   # (B, 2)  in (-1, +1)

        # scale factors
        low  = torch.tensor([0.0, -1.0], device=raw.device)
        high = torch.tensor([1.0,  1.0], device=raw.device)

        mean  = 0.5 * (high + low) + 0.5 * (high - low) * raw   # (B*L, 2)
        log_std = self.log_std_parameter.expand_as(mean)        # (B*L, 2)

        return mean, log_std, {}
    
        # return 2 * torch.tanh(self.net(inputs["states"])), self.log_std_parameter, {}
class Value(DeterministicMixin, Model):
    def __init__(self, observation_space, action_space, device, clip_actions=False):
        Model.__init__(self, observation_space, action_space, device)
        DeterministicMixin.__init__(self, clip_actions)
        self.net = nn.Sequential(nn.Linear(self.num_observations, 512),
                                 nn.ELU(),
                                 nn.Linear(512, 256),
                                 nn.ELU(),
                                 nn.Linear(256, 64),
                                 nn.ELU(),
                                 nn.Linear(64, 1))

    def compute(self, inputs, role):
        return self.net(inputs["states"]), {}

i have already try to use the default network of skrl without modify the compute function of Policy class but it do not work. Can you give me some suggestions to solve this problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PPO outputs NaNs action values #321

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

PPO outputs NaNs action values #321

Uh oh!

Uh oh!

giangdao1402 May 19, 2025

Replies: 0 comments

giangdao1402
May 19, 2025