-
Couldn't load subscription status.
- Fork 134
Open
Description
The following code uses the action logit value for the optimal action, and then diff against the log prob of the action from the last actor model iteration. Should we instead pick the action from old_actions instead just max, so that we are comparing the prob for the same action from two iterations?
# get action log prob
actions_prob = (
torch.softmax(actions_logits, dim=-1).max(dim=-1).values
)
Metadata
Metadata
Assignees
Labels
No labels