-
-
Notifications
You must be signed in to change notification settings - Fork 109
Description
While working on my package, I noticed that the the EpsilonGreedyExplorer had a strange behaviour with its output.
Here is the related function :
function (s::EpsilonGreedyExplorer{<:Any,false})(values, mask)
ϵ = get_ϵ(s)
s.is_training && (s.step += 1)
rand(s.rng) >= ϵ ? findmax(values, mask)[2] : rand(s.rng, findall(mask))
end
I seems that depending if the explorer with return the greedy choice (left side) or a random choice (right side), the output will be respectively :
- for the greedy choice : the index of the selected value in the subset of the authorized values.
- for the random choice : the index of the selected value in the original set of values.
Let me explain the problem with a little example :
values = Float32[-0.48240864, 0.07573502, -0.19618785, 0.25742468]
mask = Bool[1, 1, 0, 1]
rng = MersenneTwister()
Its clear that the highest value is of index 4. Lets simulate the output of the explorer :
- if the explorer decide to return the index of the highest authorized value (greedy choice), it will return the related index 3 of the selected value in the subset of the authorized index, and not in the total set of values :
julia> findmax(values, mask)[2]
3
this is exactly the expected behavior of the RLCore function :
Base.findmax(A::AbstractVector, mask::AbstractVector{Bool}) = findmax(i -> A[i], view(keys(A), mask))
- if the explorer decides to return a random index, it will return the index of the selected value in the original set of values :
julia> rand(rng, findall(mask))
4
The output signification is thus inconsistent. I am still discovering the package, so please let me know if I made a mistake. If this behavior turns out to be a bug, I can propose a simple fix for that.