Datasets built from various Japanese language corpora
https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.
You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data
You'll need Node.js 18 or later.
See scripts section in package.json.
Aozora:
aozora:download- use crawler/scraper to collect the dataaozora:gaiji:extract- extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent themaozora:gaiji:replacements- build gaiji replacements file - produces only partial results, which may need to be manually completedaozora:clean- clean the scraped pages (apply gaiji replacements)aozora:count- create the dataset
Wikipedia:
wikipedia:fetch- fetch random pages using MediaWiki APIwikipedia:count- create the dataset
News:
news:wikinews:fetch- fetch random pages from Wikinews using MediaWiki APInews:count- create the datasetnews:dates- create additional file with dates of articles
See Astro docs and the scripts section in package.json.