Latent Dynamics Planning for Cooperative MARL in Overcooked-AI

Harsh Sutaria | Kevin Mathew | Ayush Rajesh Jhaveri | Tiasa Singha Roy

Abstract

In this work, we systematically evaluate a Latent Dynamics Model, trained on offline human-generated trajectories within the Overcooked-AI cooperative environment. We combine latent dynamics modeling with Model Predictive Path Integral (MPPI) style planning and compare it against a planning-free baseline. Our experiments reveal that although our approach improves on the greedy baseline, it significantly trails expert-level performance. We identify critical challenges, including sparse reward feedback and cumulative prediction errors during long-horizon planning. Our findings underscore that while latent dynamics-based planning shows promise in data-efficient generalization and improved cooperative decision making from limited offline data, substantial opportunities remain for enhancing model accuracy and planning efficacy in highly cooperative and sparse reward scenarios.

Background & Motivation

A key challenge in cooperative multi-agent reinforcement learning (MARL) is developing agents capable of coordinating effectively in complex environments from limited offline data. Two primary approaches to address this are:

model-free methods that learn directly from experiences using value-based policies
model-based planning methods, which leverage learned latent dynamics to predict future states and optimize decisions.

However, the relative effectiveness of these approaches, particularly when using offline expert trajectories in sparse reward, cooperative scenarios, remains under-explored.

Training and Planning with Dynamics Model

Stage 1: Latent Dynamics Training

JEPA-style latent dynamics model trained using offline trajectories.
Learned representation aimed to capture environment dynamics effectively from sparse, discrete state-action data.

Stage 2: Q-Value Model Training

Trained a Q-value predictor to evaluate state-action pairs within the learned latent space.
Used MPPI-based planning to generate trajectories by simulating actions based on the learned dynamics model.

Stage 3: MPPI Style Planning

Although MPPI is typically used for continuous control, we use its adapted version for discrete action spaces.
Randomly sampled multiple action trajectories.
Used cumulative predicted Q-values as weights.
At each timestep, aggregated Q-values and selected the action with the highest corresponding value.

Environments and Datasets

Environment: Overcooked-AI — a cooperative multi-agent simulation where two agents coordinate to prepare and deliver dishes in a shared kitchen.

Dataset: Expert trajectories from human-human interactions.
- Stored as DataFrames containing:
  - State transitions
  - Joint actions
  - Rewards

Each trajectory represents a sequence of cooperative actions culminating in high rewards.

Experimental Setup

Implemented VICReg-style loss to prevent representation collapse.
Fine-tuned VICReg parameters via Random Grid Search (100 trials).
MPPI-based planning hyperparameters:
- Explored various combinations of:
  - Planning horizon (length of rollout)
  - Number of samples (rollouts per step)

Results

Model / Method	Avg. Reward
Expert Performance	~150
Greedy Baseline	13
PLDM + MPPI Planning	26

Configuration: num_samples = 100, planning_horizon = 10

Although the latent dynamics model improved performance over the greedy baseline, it still lags behind human-level coordination.

Summary

While PLDM-based latent state modeling combined with MPPI-style planning demonstrates potential in cooperative MARL, a significant performance gap remains when compared to human-level performance.

Key Takeaways

Sparse Rewards Challenge:

Sparse rewards severely limit planning feedback. Denser reward signals or shaping are needed.
Model Prediction Errors:

Inaccuracies in the dynamics model lead to cumulative error in multi-step rollout.
Planning Horizon and Sampling:

More accurate shorter horizons may outperform long horizons.

Effective sampling is critical.
Enhanced Exploration Strategies:

Techniques like intrinsic motivation or curiosity-driven rewards could help in sparse-reward settings.

Future Work

Improve latent dynamics model accuracy
Incorporate denser reward feedback
Optimize MPPI sampling and rollout strategy
Explore representation regularization for generalization
Consider multi-goal inference and attention-based planning

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
models		models
output_dir		output_dir
overcooked_ai		overcooked_ai
scripts		scripts
solution		solution
visual_output_12		visual_output_12
visual_output_expert		visual_output_expert
.gitignore		.gitignore
2020_samples_0.csv		2020_samples_0.csv
2020_samples_42.csv		2020_samples_42.csv
2020_samples_unique_100.csv		2020_samples_unique_100.csv
2020_samples_unique_100_42.csv		2020_samples_unique_100_42.csv
README.md		README.md
environment.yml		environment.yml
expert_scores.csv		expert_scores.csv
expert_trajectories.csv		expert_trajectories.csv
first_expert_trajectory.csv		first_expert_trajectory.csv
generate_2020.SBATCH		generate_2020.SBATCH
generate_2020.err		generate_2020.err
generate_2020.out		generate_2020.out
generate_2020_samples.py		generate_2020_samples.py
generate_2020_zero.py		generate_2020_zero.py
generate_cleaned_csv.py		generate_cleaned_csv.py
generate_expert_scores.py		generate_expert_scores.py
generate_expert_trajectories.py		generate_expert_trajectories.py
generate_samples_100.py		generate_samples_100.py
image-1.png		image-1.png
image-2.png		image-2.png
image-3.png		image-3.png
image-4.png		image-4.png
image-5.png		image-5.png
image-6.png		image-6.png
image.png		image.png
model_92000.d3		model_92000.d3
requirements.txt		requirements.txt
reward_check.py		reward_check.py
sampled_unique_states_100.csv		sampled_unique_states_100.csv
training_config.yaml		training_config.yaml
vizualization.py		vizualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Latent Dynamics Planning for Cooperative MARL in Overcooked-AI

Harsh Sutaria | Kevin Mathew | Ayush Rajesh Jhaveri | Tiasa Singha Roy

Abstract

Background & Motivation