-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I am struggling to understand your reasoning here:
Issue - The paper states that the number of sequences of actions should be 2^N. But I could only find the one sequence of right actions and N other sequences that terminate by the wrong action and the number of transitions in the replay memory to be (N(N+1)/2 + N)
Can you show how this holds for a simlpe case such as N = 3?
Here is mine:
This will form our replay memory. In total, there will be (N*(N+1)/2 + N) transitions in the list.
This also doesn't match what the paper reports. According to the paper:
The replay memory contains all therelevant experience (the total number of transitions is 2^(n+1) - 2)
In the paper they show that returing from state N to state 1 can either give a reward of 1 (green arrow) or 0 (dashed red arrow). How did you decide to implement this?

