MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
MegaHan97K contains three distinct subsets: handwritten, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.

🔥 Download

Setting	Dataset	status
General CCR	Baiduyun:k4ch/OneDrive	Released
Zero-Shot CCR	Baiduyun:bxde/OneDrive	Released

🛠️ Usage

Clone this repo:

git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git

Execute the following command to obtain example samples from the MegaHan97K dataset.

python MegaHan_Dataloader.py

Note:

The MegaHan97K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MegaHan97K dataset, please first fill in this Application Form and sign the Legal Commitment and email them to us (eelwjin@scut.edu.cn, cc: lianwen.jin@gmail.com). When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of handwriting analysis and recognition, document image processing, and so on.
We will give you the decompression password after your application has been received and approved.
All users must follow all use conditions; otherwise, the authorization will be revoked.

To access the entire dataset, please first download it, update the data_root in the python MegaHan_Dataloader.py script and then execute

python MegaHan_Dataloader.py

☎️ Contact

If you have any questions, feel free to contact Yuyi Zhang at yuyi.zhang11@foxmail.com

🌄 Gallery

Illustration of the handwritten-original data in MegaHan97K
Illustration of the handwritten-augmented data in MegaHan97K
Illustration of the M⁵HisDoc data in MegaHan97K
Illustration of the Kangxi dictionary data in MegaHan97K
Illustration of the handwritten-original data in MegaHan97K
Illustration of the handwritten-augmented data in MegaHan97K
Illustration of the synthetic data in MegaHan97K

💙 Acknowledgement

License

MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Copyright

This repository can only be used for non-commercial research purposes.
For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
Copyright 2025, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
MegaHan_Example		MegaHan_Example
application-form		application-form
images		images
MegaHan_Dataloader.py		MegaHan_Dataloader.py
MegaHan_IDS.txt		MegaHan_IDS.txt
MegaHan_codebook.txt		MegaHan_codebook.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

🔥 Download

🛠️ Usage

☎️ Contact

🌄 Gallery

💙 Acknowledgement

License

Copyright

⭐ Star Rising

About

Uh oh!

Releases

Packages

Contributors 2

Languages

SCUT-DLVCLab/MegaHan97K

Folders and files

Latest commit

History

Repository files navigation

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

🔥 Download

🛠️ Usage

☎️ Contact

🌄 Gallery

💙 Acknowledgement

License

Copyright

⭐ Star Rising

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages