this project was created to familiarize myself with reinforcement learning (deep-q, double-deep-q, deep-q-from-demonstrations), and to have a reference point for future investigations. the code is mostly annotated with source material taken from the related books, various sources around the web and my own explainations aimed to build intuition.
for a more detailed and easy(er) to understand explanation of the algorithm look at pipe-riders.ipynb and dqfd_agent.py
python .\main-dqfd-record.py # record demonstration
python .\main-dqfd.py # train
python .\main-dqfd-test.py # inference - live environment - remember to set correct checkpoint paths
starting with an exploration rate (
- average episode length: 1434 steps
the episode average suggested that there is some improvement to be done so the agent was left running for an additional 1m steps
starting with an exploration rate (
expert demonstration data can be found here: https://mega.nz/folder/W15TQCLJ#6kABEerwaZyU9nqHxR8pyQ
the trained agent can be found here: https://mega.nz/folder/ipZDGaIb#Bn1NXlWoSvRhUp0IjnCjTw
for a more detailed and easy(er) to understand explanation of the algorithm look at pipe-riders.ipynb and ddq_agent.py
python .\main-ddq.py # train
python .\main-ddq-test.py # inference - live environment - remember to set correct checkpoint paths
the basic deep q-learning agent could not make any noticable progress with this (small) amount of training steps due to the complexity of the game
- address the reward signals flaws
- add reward for actually avoiding barrier
- or provide a slight penalty when agent is close to a barrier to reinforce the need for caution
- this might be tricky due to the choice of emulation environment (can't extract too much from a browser based emulator)
- add reward for collecting time bonuses (green boxes)
- tracking health increase can be tricky when health bar is already full
- add reward for actually avoiding barrier
- experiment with different parameters for cnn (kernel size, stride, etc)
- since we don't have exact time-steps the input images have large variance, specific values can make the cnn too sensitive and make it not learn effectively (e.g. a small kernel could possibly fail to learn the underlying pattern when the input variance is high). another way to go about this is to add max-pooling layers to make the network less sensitive to exact positions of barriers
- try he initialization instead of xavier-glorot
- delayed learning signal - if the agent only receives a penalty when it hits a barrier, it may struggle to connect its previous actions to that outcome. this can make it difficult for the agent to understand which specific actions led to a negative result, hindering effective learning. the only signal it can go by are the features extracted by the convolutions which might not be enough. [update] n-step learning should have completely address this issue
- irregular frame capture might interfere with q-value approximation / feature extraction
- because of the timeout based frame-advancing the captured frames are irregular i.e frames captured in one episode might not overlap with frames of another episode even when they should be referring to the same position
- this might end up helping the neural network to generalize better because it introduces variability to cnn input as well as output. one downside might be the need for a more complex fully-connected setup
- implementing better frame-advancing -> will require to decompile the game or hook deeper into the emulator
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118
# or
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install scikit-image opencv-python
pip install numpy pandas plotly matplotlib torch-lucent torchinfo pillow
pip install keyboard selenium webdriver-manager pytest-playwright
python -m cProfile -o .\main-dqfd.generated.prof .\main-dqfd.py
python -m pip install snakeviz
python -m snakeviz .\main-dqfd.generated.prof