-
Notifications
You must be signed in to change notification settings - Fork 190
Description
Hi,
first I have to say, great work! Without your repository I wouldn't been able to get first results so quickly=)
I took your algorithm with your maze implementation. Unfortunately I had to change some stuff since the maze code is a bit older and does not include all your new implementations. After getting rid of all errors
I wasn't able to train anything for a long time until I recognized that a negative reward of <=-1 does not work at all.
What I did to proof this is: I initialized the network randomly and forced just one agent to always run directly into the wall during training. What I expected while doing this is that the probabilities for every action goes to 0.33 except the one which leads the agent directly into the wall. This should go to 0.
But I observed another behavior. The action that leads the agent into the wall converges to zero as expected but one of the other actions emphasizes to one and the other both to zero too.
As far as I observed the situation it's always the action with the highest starting probability directly after initialization which will be emphasized during this training process.
I can't imagine that this is a wanted behavior, or am I wrong?
When I clip the reward to min -0.9 it works fine!
My agent finds a way with minimal costs, even if the learning diverges after several thousand steps again.
I guess this could also be an issue addressing the same problem.
Regards, Martin