Hexbot
is a small implementation of the Proximal Policy Optimization (PPO) algorithm for training a deep neural network to play the board game Hex. The code leverages several standard pytorch
functionalities but otherwise implements PPO 'from scratch' (see the acknowledgements for our sources of inspiration). Hexbot
also comprises an implementation of an environment for the Hex board game (without the swap rule) with rendering functionality based on the pygame
library.
- Installation
- Usage examples
- Overview of the Hexbot evolution loop
- Background: The Hex board game
- Background: PPO
- References and acknowledgements
-
Clone or download the repository files to your machine.
-
Ensure the required dependencies are installed using:
pip install torch, numpy, pygame
-
Hexbot evolution (training multiple policies). The bot evolution library can be called from your python code as follows:
import botevolution bot_evo = BotEvolution(monitoring=True) bot_evo.evolve()
The argument
monitoring=True
will render a single game of the two best-performing bots at the end of each generation (see below).There are several further arguments that can be passed to the
BotEvolution
class, controlling, for instance, the hex board size or the number of agent policies to be trained per generation. The code files contain extensive docstrings listing these arguments.Bots will be saved in a folder specified by the
rootfolder
argument (defaulting to"./output/"
). Bots can be loaded usingtorch.load
. See for instance themake_them_play
function which loads two specified bots, and renders a game (or$n$ games) in which these bots play against one another.import botevolution botevolution.make_them_play("PATH-BOT1", "PATH-BOT2", NUMBER_OF_GAMES)
-
Hexbot training (training a single policy). The bot training library can be used as follows:
import bottrainer hex_size = BOARD_SIZE # e.g. 8 device = DEVICE # e.g. torch.device("cpu") agent_nn = HexBot(hex_size=BOARD_SIZE).to(device) agent_trainer = BotTrainer(game_size=BOARD_SIZE, bot_brain=agent_nn) agent_trainer.train()
-
Hex game (using the game environment). The hexgame library can be used as follows:
import hexgame game_env = HexGame(BOARD_SIZE, auto_reset=False, opponent_policy=SOME_POLICY) terminated = False while not terminated: game_state_array = game_env.flat_state() _, _, terminated = game_env.step(SOME_VALID_ACTION) input("Press Enter to quit.")
Here,
SOME_POLICY
will be aCallable
function that makes a valid move (the action space is a discrete range of numbers0
toBOARD_SIZE**2
). Similarly,SOME_VALID_ACTION
has to be a valid move, which may, for instance, be computed usinggame_state_array
.Several other arguments can be passed to the
HexGame
class, seehexgame.py
file.Other functionalities of the
HexGame
class includereset()
(which resets the game state to the initial state) andrender()
(which renders the current game state using thepygame
library). -
Hex bot module (modifying the neural network architecture). The NN architecture is a simplistic 2 layer network with <1000 parameters. The architecture can be modified in
hexbot.py
.Note that the neural network architecture is globally fixed (i.e. the same for all agent policies) when training multiple policies via
BotEvolution
.
BotEvolution
trains neural networks (=agent policies) to play the game Hex. Agent policies output a pair ($\pi$,v)
of a probability distribution over the action space as well as a value function which estimates the expected reward of the current state. The training is performed via self-play in the following nested stages (list from outer to inner stages).
- Generations: in each generation, a set of agent policies is trained. At the end of the generation, the best performing policies are selected.
- Sample updates: within each generation, we compute gameplay samples by letting current agents play against previously saved policies.
- Training epochs: within each sample update, we use the computed sample to train our neural network via several gradient updates.
- Gradient updates: Within each epoch, we update our agent policies using the Adam optimizer with a loss function that combines
- clipped PPO loss (which amplifies advantageous actions),
- value function mean square loss (which measures actual reward against the estimate),
- entropy loss (which encourages making definitive choices).
See here for an explanation of the rules of Hex. Note that our implementation omits the swap rule. Interestingly, the game became algorithmically feasible on bigger boards only recently using modern machine learning approaches (see this 2019 paper by Meta AI).
A good introduction to PPO may be found here. The basic idea (of most RL algorithms) is to try to play the game, learn what to expect, observe which moves are better than expected, and then amplify those moves.
We took inspiration from existing small projects on the topic, in particular: