Why we need the features of *all* rounds to predict the final reward? #19
Unanswered
wongsingfo
asked this question in
Q&A
Replies: 2 comments
-
I'm not sure. I was following Suphx's method because it was tested to have worked. Maybe you could do some experiment by replacing the GRU part with 2 layers of MLP of the same number of parameters, and see if the performances are the same. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think the assumption here is that a player will tend to use the same strategy in all rounds (for both human players and AIs), so that you can predict the behaviour of a player by its actions in previous rounds. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Both mortal and suphx [1] use a global reward predictor to predict the final game reward when the i-th game begins. The predictor uses the features (i.e. scores of 4 players, grand_kyoku, honba, and kyotaku) of not only the i-th round but also all previous rounds.
I am wondering why we need the features before the i-th round? I think they are independent factors for the final reward. In other words, no matter how well or how poor the player performs from the first round to the (i-1)-th round, the expected final ranking should be the same given that the features of the i-th round are the same.
[1] Suphx: Mastering Mahjong with Deep Reinforcement Learning. arXiv preprint arXiv:2003.13590, 2020a. Section 3.2
Beta Was this translation helpful? Give feedback.
All reactions