The goal of the YAMS Reinforcement Learning (RL) project is to develop an agent capable of playing the Yahtzee game optimally using RL algorithms. The agent will aim to maximize its total score after playing 10 games, learning how to make the best decisions about which dice to keep and which to re-roll. The RL model will be evaluated on its ability to choose optimal actions based on the game’s state and its learning progress.
Each game of Yahtzee is divided into thirteen rounds. During each round, a player rolls five dice. The player can re-roll the dice up to two more times and may choose which specific dice to re-roll. After completing the rolls, the player must assign the final roll to one of the thirteen categories on their score-sheet. The score for each round is determined by how well the dice match the chosen category. Once a category is selected, it cannot be used again for the remainder of the game. The game ends when all categories are filled, and the player with the highest total score wins.
For example, imagine a player rolls a 1, 2, 2, 3, and 5 on their first roll. The player decides to re-roll the 3 and 5, obtaining a 2 and 4. The player re-rolls the 4 again and gets another 2, resulting in a final roll of 1, 2, 2, 2, 2. The player then assigns this roll to the "Twos" category, where the score is the sum of all dice that show a 2. In this case, the score would be 2 + 2 + 2 + 2 = 8 points for that round.
The agent’s objective is to maximize the total score after 10 games. The performance measures include:
- Total Score: The sum of dice values after 10 games.
- Efficiency: The ability to make optimal decisions after each roll.
- Number of Dice Kept: Indicates the strategy adopted by the agent.
The environment consists of two parts: the full game environment and the turn environment.
The game environment defines the rules of the Yahtzee game, manages turns and episodes, and assigns rewards based on the actions taken.
Main Attributes:
self.dices, self.faces, self.turns
: Parameters of the game.self.S
: Describes the point categories in the game (sum of faces, three of a kind, pair, chance, etc.).
Key Methods:
get_action
: Calculates the possible rewards for a given state.play_episode
: Simulates a full game episode, including dice rolls and selected actions.choose_action
andchoose_random_action
: Select actions either strategically (via Q-values) or randomly.
The turn environment models a single game round, managing the state of the dice, possible actions, and transitions between states.
Main Attributes:
self.dices, self.faces
: Number of dice and sides per die.self.Roll, self.Roll_P
: Possible dice states and their probabilities.self.S
: List of initial dice states.self.Aa
: List of all possible actions.
Key Methods:
get_states
: Generates all possible dice states and their probabilities.get_actions_from_state
: Determines the possible actions for a given state.get_actions_list
: Generates a global list of all possible actions for all states.One_step_backward
: Updates state values and Q-values through a dynamic learning step.choose_best_action
: Selects the action with the highest Q-value for a given state.
The project features several agents that employ different reinforcement learning techniques, including:
- Random Agent: Chooses actions randomly.
- Greedy Agent: Selects the action with the highest immediate reward.
- Monte Carlo Agent: Learns from complete episodes by averaging returns from different states.
- Q-learning Agent: Uses Q-values to make decisions based on past experiences.
- SARSA Agent: Similar to Q-learning but updates Q-values using the action taken in the next step.
- Perceptron Q-learning Agent: Combines Q-learning with a perceptron model for function approximation.
Property | Description |
---|---|
Observability | The environment is fully observable. The agent can see the value of all dice after each roll. |
Determinism | The environment is stochastic due to the random nature of dice rolls. |
Dynamics | The environment is sequential, as decisions (such as which dice to keep or re-roll) affect future outcomes. |
Time Complexity | The game is discrete, with each action occurring in defined steps (roll, keep, re-roll). |
Autonomy | The agent is autonomous, making decisions based on its observations, without external interference. |
Multi-agent | This is a single-agent environment, where only one player interacts with the environment. |
- Score max: 186
- Score min: 82
- Score moyen: 119.01

- Greedy Agent Level 1: It never re-rolls the dice after the first roll and chooses the action that immediately maximizes its score.
- Greedy Agent Level 2: It can re-roll once and chooses the action that maximizes the expected score after one re-roll.
- Greedy Agent Level 3: It can re-roll twice and chooses the action that maximizes the expected score after two re-rolls.
- SARSA Update:
Q(s, a) ← Q(s, a) + α [ r + γ Q(s', a') - Q(s, a) ]
where a' is the next action chosen by the policy in the next state s'. - Q-Learning Update:
Q(s, a) ← Q(s, a) + α [ r + γ maxa' Q(s', a') - Q(s, a) ]
Q-Learning uses the best possible Q value without considering the agent's current policy.
Ref : https://web.stanford.edu/class/aa228/reports/2018/final75.pdf (Paper Reinforcement Learning for Solving Yahtzee)

- Number of dice: 5
- Number of faces per die: 6
- Epsilon (exploration): 0.2
- Alpha (learning rate): 0.01
- Gamma (discount factor): 0.9
- Maximum number of turns per game: 7

This project is licensed under the MIT License - see the LICENSE file for details.