The policy-based version of Mortal has been open-sourced #91

Nitasurin · 2025-01-28T16:58:03Z

Nitasurin
Jan 28, 2025

https://github.com/Nitasurin/Mortal-Policy
Its complete original implementation based on MortalV2 has been used in certain scenarios since 2022

Equim-chan · 2025-01-29T02:23:15Z

Equim-chan
Jan 29, 2025
Maintainer

Excellent work!

Policy-based methods were what I started with at the very beginning (pre-v1). I experimented with various common methods and practices and spent a lot of time (several months) tuning it. However, they all failed to outperform the baseline. In the end, surprisingly and somewhat unreasonably, simple value-based methods turned out to be effective, so I went on building Mortal with value-based methods.

2 replies

Nitasurin Jan 30, 2025
Author

Could you clarify which baseline you're referring to? Is it Akochan? According to my tests, this repository shows improvements over the baseline model I previously submitted to mjai.app

Equim-chan Jan 31, 2025
Maintainer

The baseline was a supervised learning model.

hyskylord · 2025-02-26T09:55:32Z

hyskylord
Feb 26, 2025

Have you tried to compare the model with Akochan or Akagi? I think mjai.app doesn't provide enough samples to compare the performance of models.

Also, what's the size of your model?

0 replies

Nitasurin · 2025-02-26T16:08:09Z

Nitasurin
Feb 26, 2025
Author

@hyskylord

From late 2022 to early 2023, I trained the Mortal-V2 and a policy version modified from Mortal-V2. The test results are as follows:
Policy-based online + behavior cloning from value-based offline-to-online Mortal-V2 samples (now distillation might be a better choice) > value-based offline-to-online Mortal-V2 > value-based offline-only Mortal-V2.
Among these, value-based online achieved only a very small performance improvement compared to offline (approximately +0.6PT in 500,000 Hanchan in 1v3 scenarios). However, after converting to Policy-based online, the performance improvement exceeded expectations, and the actual gameplay results on certain platforms were also satisfactory.
In April 2024, I conducted tests using GPUs, accumulating several million Hanchan. By analyzing the historical performance of all players in the training dataset (including data obtained from some websites) and in-game statistics, I performed player-by-player cleaning of each dataset sample. Even the weakest value-based offline-only Mortal-V2 still outperformed the latest open weights in the community (approximately +1.8~2.0PT in 200,000 Hanchan against Akagi offline 240308 V4 best/V3 best in 1v3 scenarios).
Due to the inability to obtain newer weights from the community, I remain cautiously confident about the above data as of February 2025.
In the policy version modified from Mortal-V2, I did not use ResNet-1D-CNN but instead opted for a more lightweight model network. Typically, online training does not require as large a parameter scale as offline reinforcement learning, while also improving training and sampling speeds.
However, for specific reasons, I did not include the incremental code used for online training in the Mortal-policy library, such as opponent pools, replay buffers, multi-GPU parallelism, etc. (similar implementations exist in existing open-source libraries). Instead, I only modified the original Mortal repository code to implement the loss function and reward calculation.
AWR (Advantage-Weighted Regression) is a new addition when I decided to open-source, replacing Behavior Cloning. I experimented with several hyperparameter sets, and under the best parameter group and a cleaned dataset, it provided performance slightly below CQL as initial parameters for online training. With appropriate hyperparameters, I also conducted preliminary online training, and the performance of Mortal-policy met my expectations.
I remain cautious about the results from mjai.app, and this attitude extends to actual gameplay. However, this is not due to the common viewpoint of "insufficient samples" (in the absence of parallelization conditions, conducting 100,000 games for a single weight version is unacceptable in terms of time cost).
Instead, it is because I noticed that explicit or implicit opponent strategy modeling/estimation may have a greater impact on actual gameplay results than the differences shown in direct AI vs. AI matches. These influences cannot be adequately explained by insufficient samples or luck.

0 replies

KimamanaNeko · 2025-03-23T20:53:28Z

KimamanaNeko
Mar 23, 2025

I believe that compared to model architecture, RL algorithms are more important. I’ve tested Transformer architectures, CNN architectures, and hybrid architectures combining CNN with self-attention.
Although different architectures show varying convergence speeds in the early stages, the final performance in matches ultimately depends on whether the RL algorithm can discover stronger strategies in a game like Mahjong, which features highly sparse and discrete rewards.

Personally, I favor distributional Q-value methods. They not only allow for easy policy switching, but also enable the assessment of action risk variability, thereby reinforcing more deterministic strategies. Moreover, they make it easier to simulate opponents with diverse playing styles in an online training environment.

1 reply

Nitasurin Mar 24, 2025
Author

Yes, as long as the model architecture achieves stable convergence under the specified observation space, lightweight models effectively reduce computational costs without compromising post-convergence performance. I chose Policy-based methods to prioritize implementation of parallel environment logic (e.g., communication protocols, throughput optimization, edge case handling) and align with established practices—several proprietary Mahjong AI have demonstrated success with PG/Actor-Critic methods.

hyskylord · 2025-08-06T12:36:47Z

hyskylord
Aug 6, 2025

I have tested the policy-based version for several months, so I want to share some thoughts.

The most evident improvement of the policy-based version, in my opinion, is the sampling. I never managed to figure out the optimal hyperparameters for the Boltzmann sampler of the value-based Mortal, but in the policy-based Mortal, the exploration can be easily controlled by the entropy weight.

It seems to me that the policy-based Mortal is also better in numerical stability, but maybe it is just about random seeds.

About performance, it seems that the policy-based Mortal trains slower than the value-based Mortal. However, due to the stability issues I encountered in the value-based Mortal, I was not able to proceed with the training further for the value-based Mortal, and finally, the policy-based Mortal outperformed the value-based Mortal after many more training steps (~50M games for policy-based and ~10M for value-based).

0 replies

Uh oh!

The policy-based version of Mortal has been open-sourced #91

Uh oh!

Nitasurin Jan 28, 2025

Replies: 5 comments · 3 replies

Uh oh!

Uh oh!

Equim-chan Jan 29, 2025 Maintainer

Uh oh!

Uh oh!

Nitasurin Jan 30, 2025 Author

Uh oh!

Equim-chan Jan 31, 2025 Maintainer

Uh oh!

hyskylord Feb 26, 2025

Uh oh!

Uh oh!

Nitasurin Feb 26, 2025 Author

Uh oh!

KimamanaNeko Mar 23, 2025

Uh oh!

Nitasurin Mar 24, 2025 Author

Uh oh!

hyskylord Aug 6, 2025

Nitasurin
Jan 28, 2025

Replies: 5 comments 3 replies

Equim-chan
Jan 29, 2025
Maintainer

Nitasurin Jan 30, 2025
Author

Equim-chan Jan 31, 2025
Maintainer

hyskylord
Feb 26, 2025

Nitasurin
Feb 26, 2025
Author

KimamanaNeko
Mar 23, 2025

Nitasurin Mar 24, 2025
Author

hyskylord
Aug 6, 2025