Skip to content

RL agent learns a shopping policy in a 2D grid, combining multi-attribute rewards with a cleanliness penalty. Includes RPE tracking and live value heatmap.

License

Notifications You must be signed in to change notification settings

artzylabs/store-rl-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Store RL Agent — Gridworld Q‑learning

RL agent learns to “shop smart” in a grid world, balancing rewards and dirtiness.


Overview

A compact reinforcement‑learning project.
An agent moves on a 2D grid with multiple stores. Each store yields a multi‑attribute reward (bread, milk, eggs) and a dirt penalty.
Using Q‑learning with decaying ε‑greedy exploration, the agent learns a policy that trades off product quality and store cleanliness. The loop tracks Reward Prediction Error (RPE) and visualizes learning as a state‑value heatmap.


Features

  • Grid world with randomly placed stores (default: 5)
  • Hidden average rewards per product (bread, milk, eggs)
  • Dirt level per store (penalty)
  • Discrete actions: UP, DOWN, LEFT, RIGHT
  • Q‑learning with ε‑greedy decay
  • RPE tracking and per‑episode averages
  • Live heatmap of learned state values; final view after training
  • Saves results.csv (metrics) and q_table.pkl (learned values)

Why this matters

A clear, compact example of multi‑attribute decision‑making in RL.
It showcases core ideas—Q‑learning, exploration vs. exploitation, and RPE—while integrating multiple utilities (products) and a cost (dirt) in a single value signal.


Install

python -m venv .venv
# Windows: .venv\Scripts\activate
source .venv/bin/activate
pip install -r requirements.txt

Requires Python 3.9+.


Run

python bandit_game.py

A pygame window opens and training runs for the configured number of episodes.
At the end, the script shows a final view and then an RPE plot.


Configuration

Tune parameters near the top of bandit_game.py (grouped and commented).

Training

Key Default
EPISODES 730
MAX_STEPS_PER_EPISODE 100
ALPHA 0.1
GAMMA 0.9

Exploration

Key Default
EPSILON_START 0.9
EPSILON_MIN 0.05
EPSILON_DECAY_RATE 0.005

Environment

Key Default
GRID_W, GRID_H 20, 15
N_STORES 5
DIRT_PENALTY_SCALE 2.0

Output

  • results.csv – episode metrics (e.g., average RPE, episode reward)
  • q_table.pkl – pickled dict of (state, action) -> Q‑value
  • Matplotlib window – RPE learning curve

Quick‑look (Python):

import pickle, csv
Q = pickle.load(open('q_table.pkl','rb'))
with open('results.csv') as f:
    print(next(csv.reader(f)))  # header

Repository Layout

bandit_game.py        # main script (training, visualization, saving)
requirements.txt      # pygame, matplotlib
LICENSE               # MIT
README.md             # this file

Roadmap

  • CLI flags (argparse) for episodes/epsilon grid search
  • Deterministic seeding option for reproducibility
  • Store distributions per product (non‑Gaussian)
  • Policy/value export heatmaps as images
  • Unit tests for reward and transition functions

License

MIT — see LICENSE.

About

RL agent learns a shopping policy in a 2D grid, combining multi-attribute rewards with a cleanliness penalty. Includes RPE tracking and live value heatmap.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages