Q-learning is a model-free reinforcement learning algorithm to learn quality of actions telling an agent what action to take under what circumstances. This study aims to simulate the environment for Q-Learning by importing it from OpenAI gym and to maximize the agent's reward. Baseline hyperparameters will also be used to create a benchmark to understand the performance of the hyperparameter tuning. This study looks at the impact of changing the learning rate, discount rate, epsilon value, and epsilon decay rate. We will see avergae score metrics and a rendered environment to understand how well the agent is performing. We will also see that certain paramerter tuning causes a dramatic difference. The findings from this study indicate that it is important to do hyper parameter tuning for a reinforcement learning environment.
We can conclude that the Q-learning algorithm is a great way to tackle the problem i.e to balance the Cart Pole. We were getting good rewards but there is a definite need of improvement. We can see that a low learning rate is good as there are not too many updates to the Q table. For this situation, the discount factor being high seems to do the best. This is because intermediary states here will be the tipping of the cart pole and this information is important. Also, decaying the epsilon value gives a good exploration-exploitation tradeoff. Stability seems to be a huge factor from the graphs. Experience Replay and Deep Learning could be used to get better and more stable results.