Skip to content

[PR 2025] The official GitHub page of "MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories"

Notifications You must be signed in to change notification settings

SCUT-DLVCLab/MegaHan97K

Repository files navigation

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Megahan97K_LOGO

SCUT DLVC Lab Pattern Recognition arXiv preprint Code

  • We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
  • MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
  • MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
  • MegaHan97K contains three distinct subsets: handwritten, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
  • MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.

overview

🔥 Download

Setting Dataset status
General CCR Baiduyun:k4ch/OneDrive Released
Zero-Shot CCR Baiduyun:bxde/OneDrive Released

🛠️ Usage

  • Clone this repo:
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git
  • Execute the following command to obtain example samples from the MegaHan97K dataset.
python MegaHan_Dataloader.py

Note:

  • The MegaHan97K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MegaHan97K dataset, please first fill in this Application Form and sign the Legal Commitment and email them to us (eelwjin@scut.edu.cn, cc: lianwen.jin@gmail.com). When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of handwriting analysis and recognition, document image processing, and so on.
  • We will give you the decompression password after your application has been received and approved.
  • All users must follow all use conditions; otherwise, the authorization will be revoked.
  • To access the entire dataset, please first download it, update the data_root in the python MegaHan_Dataloader.py script and then execute
python MegaHan_Dataloader.py

☎️ Contact

If you have any questions, feel free to contact Yuyi Zhang at yuyi.zhang11@foxmail.com

🌄 Gallery

  • Illustration of the handwritten-original data in MegaHan97K handwo

  • Illustration of the handwritten-augmented data in MegaHan97K handwa

  • Illustration of the M5HisDoc data in MegaHan97K m5

  • Illustration of the Kangxi dictionary data in MegaHan97K kx

  • Illustration of the handwritten-original data in MegaHan97K mwo

  • Illustration of the handwritten-augmented data in MegaHan97K mwa

  • Illustration of the synthetic data in MegaHan97K syn

💙 Acknowledgement

License

MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Copyright

⭐ Star Rising

Star Rising

About

[PR 2025] The official GitHub page of "MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages