Adversarial Soft Actor-Critic (AdversarialSAC)

This project enhances the Soft Actor-Critic (SAC) algorithm by incorporating adversarial heuristics to improve policy performance in reinforcement learning tasks. Built upon stable_baselines3.

Overview

The AdversarialSAC algorithm introduces an adversarial component to the standard SAC framework. By periodically comparing the agent's policy against a heuristic or baseline policy, the algorithm adjusts the rewards to favor actions that outperform the baseline. This approach aims to enhance the probability of selecting superior actions by rewarding transitions leading to better outcomes.

Features

Adversarial Heuristic Integration: Periodically evaluates the agent's policy against a baseline heuristic to adjust rewards based on performance comparisons.
Dynamic Reward Scaling: Modifies rewards during training to emphasize transitions that result in better performance relative to the baseline.
Enhanced Policy Improvement: Aims to increase the likelihood of selecting actions that outperform the baseline policy.

Detailed Infos

Adversarial Reward Modification Strategy

The Adversarial Soft Actor-Critic (AdversarialSAC) algorithm enhances policy learning by periodically comparing the agent's performance against an adversarial heuristic policy. The goal is to refine the reward signal, increasing the likelihood of selecting actions that outperform the baseline.

Adversarial Rollout and Reward Adjustment

Every N episodes, the algorithm performs a rollout using:

Policy B (Trained Policy): The agent’s learned policy.
Heuristic A (Baseline Policy): A predefined or adversarial heuristic policy.

At the end of each episode, the cumulative rewards (R_B for Policy B and R_A for Heuristic A) are compared:

If R_B > R_A → Increase the rewards for all transitions in the buffer, reinforcing actions that led to better outcomes.
If R_B < R_A → Decrease the rewards for all transitions in the buffer, discouraging actions that underperformed.
If R_B = R_A → Rewards remain unchanged, as both policies performed equally.
If R_A = 0 → Rewards remain unchanged to avoid division errors.

Why Modify Rewards Instead of Actions?

In continuous action spaces, modifying logits directly is impractical, unlike in discrete action spaces.
Assigning a reward bonus to every transition helps with the credit assignment problem, where it’s difficult to determine which specific actions contributed to a performance increase.

Future Improvements

Action-Based Scaling: Instead of uniformly adjusting all rewards, compare sequences of actions between Policy B and Heuristic A. Scale the reward bonus for each transition based on how much the action deviates from the adversarial baseline.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adversarial Soft Actor-Critic (AdversarialSAC)

Table of Contents

Overview

Features

Detailed Infos

Adversarial Reward Modification Strategy

Adversarial Rollout and Reward Adjustment

Why Modify Rewards Instead of Actions?

Future Improvements

About

Uh oh!

Releases

Packages

Languages

License

zxnga/AdversarialSAC

Folders and files

Latest commit

History

Repository files navigation

Adversarial Soft Actor-Critic (AdversarialSAC)

Table of Contents

Overview

Features

Detailed Infos

Adversarial Reward Modification Strategy

Adversarial Rollout and Reward Adjustment

Why Modify Rewards Instead of Actions?

Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages