- We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
- MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
- MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
- MegaHan97K contains three distinct subsets: handwritten, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
- MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.
Setting | Dataset | status |
---|---|---|
General CCR | Baiduyun:k4ch/OneDrive | Released |
Zero-Shot CCR | Baiduyun:bxde/OneDrive | Released |
- Clone this repo:
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git
- Execute the following command to obtain example samples from the MegaHan97K dataset.
python MegaHan_Dataloader.py
Note:
- The MegaHan97K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MegaHan97K dataset, please first fill in this Application Form and sign the Legal Commitment and email them to us (eelwjin@scut.edu.cn, cc: lianwen.jin@gmail.com). When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of handwriting analysis and recognition, document image processing, and so on.
- We will give you the decompression password after your application has been received and approved.
- All users must follow all use conditions; otherwise, the authorization will be revoked.
- To access the entire dataset, please first download it, update the
data_root
in the pythonMegaHan_Dataloader.py
script and then execute
python MegaHan_Dataloader.py
If you have any questions, feel free to contact Yuyi Zhang at yuyi.zhang11@foxmail.com
-
Illustration of the handwritten-augmented data in MegaHan97K
-
Illustration of the handwritten-augmented data in MegaHan97K
MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
- Copyright 2025, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.