RL agent learns to “shop smart” in a grid world, balancing rewards and dirtiness.
A compact reinforcement‑learning project.
An agent moves on a 2D grid with multiple stores. Each store yields a multi‑attribute reward (bread, milk, eggs) and a dirt penalty.
Using Q‑learning with decaying ε‑greedy exploration, the agent learns a policy that trades off product quality and store cleanliness. The loop tracks Reward Prediction Error (RPE) and visualizes learning as a state‑value heatmap.
- Grid world with randomly placed stores (default: 5)
- Hidden average rewards per product (bread, milk, eggs)
- Dirt level per store (penalty)
- Discrete actions:
UP
,DOWN
,LEFT
,RIGHT
- Q‑learning with ε‑greedy decay
- RPE tracking and per‑episode averages
- Live heatmap of learned state values; final view after training
- Saves
results.csv
(metrics) andq_table.pkl
(learned values)
A clear, compact example of multi‑attribute decision‑making in RL.
It showcases core ideas—Q‑learning, exploration vs. exploitation, and RPE—while integrating multiple utilities (products) and a cost (dirt) in a single value signal.
python -m venv .venv
# Windows: .venv\Scripts\activate
source .venv/bin/activate
pip install -r requirements.txt
Requires Python 3.9+.
python bandit_game.py
A pygame
window opens and training runs for the configured number of episodes.
At the end, the script shows a final view and then an RPE plot.
Tune parameters near the top of bandit_game.py
(grouped and commented).
Training
Key | Default |
---|---|
EPISODES |
730 |
MAX_STEPS_PER_EPISODE |
100 |
ALPHA |
0.1 |
GAMMA |
0.9 |
Exploration
Key | Default |
---|---|
EPSILON_START |
0.9 |
EPSILON_MIN |
0.05 |
EPSILON_DECAY_RATE |
0.005 |
Environment
Key | Default |
---|---|
GRID_W , GRID_H |
20, 15 |
N_STORES |
5 |
DIRT_PENALTY_SCALE |
2.0 |
results.csv
– episode metrics (e.g., average RPE, episode reward)q_table.pkl
– pickled dict of(state, action) -> Q‑value
- Matplotlib window – RPE learning curve
Quick‑look (Python):
import pickle, csv
Q = pickle.load(open('q_table.pkl','rb'))
with open('results.csv') as f:
print(next(csv.reader(f))) # header
bandit_game.py # main script (training, visualization, saving)
requirements.txt # pygame, matplotlib
LICENSE # MIT
README.md # this file
- CLI flags (
argparse
) for episodes/epsilon grid search - Deterministic seeding option for reproducibility
- Store distributions per product (non‑Gaussian)
- Policy/value export heatmaps as images
- Unit tests for reward and transition functions
MIT — see LICENSE
.