by Victor I. Afolabi
- 
- state
- action
- reward
- transition function
- policy
- the bellmans equation
 
state (s): Every possible position the reinforcement learning agent could be in at any given time.
Action (a): What an agent can do in an environment or agent's interaction with the environment.
Reward (r): What the agent gets for either performing an action R(a). It could also be represented as performing an action a in a given state s denoted by R(s,a) or performing an action in a state and ending up in another state s´ denoted by R(s´|s,a)
Transition function T(s,a,s´) or P(s´|s,a): The probability that the agent is in a state s, performs an action a and ends up in another state s´ (which could be the same state i.e s≈s´)
Policy (π): The is the solution to the MDP i.e. it is a function that takes in a state & returns an action π(s)->a
The Bellmans Equation:
Navigating the OpenAI's FrozenLake environment using two model-based reinforcement leaerning techniques:
- Value Iteration &
- Policy Iteration
Value iteration is ...
Pseudocode:
choose initial estimate of optimal value function
repeat until change in values is sufficently small {
  for each state {
    calculate the maximum expected value of
    neighboring state for each possible action
    use maximal value of the list to update estimate of optimal value function
  } each state
} convergence
calculate optimal value frunction from Bellmans' Equation
- Value iteration computes the optimal state value function by iteratively improving the estimate of V(s)
- The algorithm initializes V(s)to arbitrary random values.
- It repeatedly updates Q(s,a)&V(s)values until they converge.
- Value iteration is guaranteed to converge to optimal values.
Policy iteration is ...
Pseudocode:
choose initial policy & value function
repeat until policy is stable {
  1. Policy evaluation:
  repeat until change in values is sufficiently small {
    for each state {
      calculate the value of neighboring states
      when taking actions according to current policy.
      update estimate of optimal value function.
    } each state
  } converge
  2. Policy improvement:
  New policy according to Bellmans Equation,
  assuming V^* ≈ current V^π
} policy
- Policy iteration instead of repeated improving of the value-function estimate, it will re-define the policy at each step and compute the value according to this new policy until the policy converges.
- Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge that the value-iteration algorithm.
In Policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already the optimal).
In Value iteration algorithms, you start with a random value function and then find a new (improved) value function in a iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the value function by a simple argmax operation.
Policy iteration is generally faster than value iteration as policy converges more quickly than value function.