Continous Action Space outputs are unbounded #329

Sanjay1911 · 2025-06-28T14:39:24Z

Sanjay1911
Jun 28, 2025

Hello everyone,

I'm currently using Isaac Lab and SKRL for a Continous Action Space task where the agent moves my tooltip by a delta based on sensor collision feedback. I use SKRL with GaussianMixin model and observed that my action output is always unbounded.

Action Space Defintion:
action_space = spaces.Box(low=-0.0005, high=0.0005, shape=(2,))

SKRL GaussianMixin:

class ContinouosActionPolicy(GaussianMixin, Model):
    def __init__(self, observation_space, action_space, device,
                 clip_actions=False, clip_log_std=True, min_log_std=-20, max_log_std=2):
        Model.__init__(self, observation_space, action_space, device)
        GaussianMixin.__init__(self, clip_actions, clip_log_std, min_log_std, max_log_std)

        self.pointnet = PointNetExtractor(point_channel=3, output_dim=256)  # your observation must be [B, N, 3]

        self.actor = nn.Sequential(
            nn.Linear(264, 128),
            nn.ELU(),
            nn.Linear(128, 64),
            nn.ELU(),
        )
        self.mean_layer = nn.Linear(64, self.num_actions)
        self.log_std_parameter = nn.Parameter(0.5 * torch.ones(self.num_actions))
    
    def compute(self, inputs, role):
        states = inputs["states"]
        states = unflatten_tensorized_space(self.observation_space, states)  # https://github.com/Toni-SM/skrl/discussions/205
        pcd = states["raycaster"]
        if pcd.ndim == 2:
            pcd = pcd.unsqueeze(0)  # Ensure batch dimension exists

        pointnet_features = self.pointnet(pcd)  # -> [B, 256]

        # Other inputs (make sure all are [B, D])
        depth_tumor = states["depth_tumor"]
        if depth_tumor.ndim == 1:
            depth_tumor = depth_tumor.unsqueeze(0)

        tooltip_pos = states["tooltip_position"]
        if tooltip_pos.ndim == 1:
            tooltip_pos = tooltip_pos.unsqueeze(0)

        tooltip_quat = states["tooltip_quaternion"]
        if tooltip_quat.ndim == 1:
            tooltip_quat = tooltip_quat.unsqueeze(0)

        # Concatenate all features
        other_features = torch.cat([depth_tumor, tooltip_pos, tooltip_quat], dim=-1)  # [B, 1 + 7 = 8]
        x = torch.cat([pointnet_features, other_features], dim=-1)  # [B, 256 + 8
        x = self.actor(x)
        raw_mean = self.mean_layer(x)
        mean = torch.tanh(raw_mean)
        noise = torch.randn_like(raw_mean)
        std = torch.exp(self.log_std_parameter)
        sampled_action = raw_mean + std * noise
        squashed_action = torch.tanh(sampled_action)  # squashing the action to be in [-1, 1] https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html , https://www.reddit.com/r/reinforcementlearning/comments/1hwau8q/clipping_vs_squashed_tanh_for_rescaling_actions/- Why should I normalize actions?
        if DEBUG:
            print("Policy raw output before tanh:", raw_mean)
            #print("Policy mean (after tanh):", mean)
            print("Policy sampled action (before squashing):", sampled_action)
            print("Policy squashed action (after tanh):", squashed_action)
        return mean, self.log_std_parameter, {}

Main Script in Isaac Lab:

def _pre_physics_step(self, actions: torch.Tensor):
        """
        Process agent actions:
        - actions[:, :3]: delta position (x, y, z)
        - orientation remains fixed to preop base orientation
        """
        self.actions = actions.clone()  # Store the actions for later use
        low = torch.tensor(self.single_action_space.low, device=self.device)
        high = torch.tensor(self.single_action_space.high, device=self.device)
        print(f"Picked actions: {self.actions}")
        # Scale action values to the range of the action space from [-1, 1] to [low, high] using the formula:
        # scaled_action = ((x-a)/(b-a)) * (d-c) + c where x belongs to [a, b] and scaled_action belongs to [c, d]
        # Here, a = -1, b = 1, c = low, d = high
        a,b = -1, 1  # Action space range
        # This scales the actions to the range [low, high]
        self.actions = (((self.actions - a)/(b - a))*(high-low))+low  # Normalize actions to [-1, 1]
        self.new_actions = 0.5 * (self.actions + 1.0) * (high - low) + low
        #self.actions = torch.clamp(self.actions, low, high)
        print(f"Scaled actions: {self.actions}, {self.new_actions}")

Output:

Policy raw output before tanh: tensor([[-0.6752, -0.3823]], device='cuda:0')
Policy sampled action (before squashing): tensor([[-2.5800,  1.2784]], device='cuda:0')
Policy squashed action (after tanh): tensor([[-0.9886,  0.8561]], device='cuda:0')
Picked actions: tensor([[-0.2809, -1.1995]], device='cuda:0')
Scaled actions: tensor([[-0.0001, -0.0005]], device='cuda:0'), tensor([[-7.0198e-08, -2.9989e-07]], device='cuda:0')

Here when I print picked actions, the output is always unbounded. Since I already do tanh squash, I again map it to my action space high and low. I would expect my picked action to be bounded in [-1,1] but that's not the case. I'm not an RL expert and is unsure if this would damage my training.

Is there any other approach to do this cleanly?

Answered by Toni-SM

Jun 30, 2025

Hi @Sanjay1911

The skrl's Gaussian model concept represents the data flow when using such a mixin:

The mean (in the interval [-1,1] due do the tanh output) feeds a Gaussian distribution. Such distribution, when the log_std is not 0, samples values not bounded in the limits [-1,1], as discussed in #63 (comment) (I recommend you take a look at this discussion).

If you want to limit the final output of the Gaussian model, you need to enable clip_actions while defining a bounded Gymnasium space (action_space = spaces.Box(low=-1, high=1, shape=(2,))) letting the application of the scale (0.0005) to the task implementation.

View full answer

Toni-SM · 2025-06-30T15:55:31Z

Toni-SM
Jun 30, 2025
Maintainer

Hi @Sanjay1911

The skrl's Gaussian model concept represents the data flow when using such a mixin:

The mean (in the interval [-1,1] due do the tanh output) feeds a Gaussian distribution. Such distribution, when the log_std is not 0, samples values not bounded in the limits [-1,1], as discussed in #63 (comment) (I recommend you take a look at this discussion).

If you want to limit the final output of the Gaussian model, you need to enable clip_actions while defining a bounded Gymnasium space (action_space = spaces.Box(low=-1, high=1, shape=(2,))) letting the application of the scale (0.0005) to the task implementation.

0 replies

Sanjay1911 · 2025-07-01T08:25:57Z

Sanjay1911
Jul 1, 2025
Author

Thank you @Toni-SM for your detailed answer. I already tried reducing log_std and was able to get decent values in my desired range. I also wanted to ask another question which is not specific to SKRL but RL in general as I couldn't find the information elsewhere.

Is it generally advisable in RL to pause the action space with a flag when the observations are still running? Meaning I've a task where I get a Delta offset to my initial position (Action of the agent) and once this is known, I insert a needle - gather sensor data - calculate rewards. But during this cycle, I dont want my actions to be updated since that won't let me finish the insertion completely and would move my tool from the current new position.

# pre-physics step calls
#   |-- _pre_physics_step(action)
#   |-- _apply_action()
# post-physics step calls
#   |-- _get_dones()
#   |-- _get_rewards()
#   |-- _reset_idx(env_ids)
#   |-- _get_observations()

In this I basically put a flag and stop the apply_action update (which takes in action outputs from pre_physics_step). But my post-physics calls are still used in training.From what I understood especially in Actor-Critic networks, where the actions are taken every sim step - my task doesn't suit this setup.

Any advice/suggestion would be of immense help.

1 reply

Toni-SM Jul 1, 2025
Maintainer

Hi @Sanjay1911

A way to go is to modify the Isaac Lab DirectRLEnv class to have a logic for stepping the physics as many times as you need.
Stepping physics occurs in between of the pre-physics step call (_pre_physics_step(action)) and the post-physics step calls, so you have time to perform your entirely physics simulation (using the same action taken by the RL agent).

https://github.com/isaac-sim/IsaacLab/blob/752be19bade88c1f1e2a06a1ca6519baafbba216/source/isaaclab/isaaclab/envs/direct_rl_env.py#L341-L356

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Continous Action Space outputs are unbounded #329

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Continous Action Space outputs are unbounded #329

Uh oh!

Sanjay1911 Jun 28, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

Toni-SM Jun 30, 2025 Maintainer

Uh oh!

Sanjay1911 Jul 1, 2025 Author

Uh oh!

Uh oh!

Toni-SM Jul 1, 2025 Maintainer

Sanjay1911
Jun 28, 2025

Replies: 2 comments 1 reply

Toni-SM
Jun 30, 2025
Maintainer

Sanjay1911
Jul 1, 2025
Author

Toni-SM Jul 1, 2025
Maintainer