Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

aozora:download - use crawler/scraper to collect the data
aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
aozora:clean - clean the scraped pages (apply gaiji replacements)
aozora:count - create the dataset

Wikipedia:

wikipedia:fetch - fetch random pages using MediaWiki API
wikipedia:count - create the dataset

News:

news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
news:count - create the dataset
news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
data2015		data2015
public		public
scripts		scripts
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.cjs		.prettierrc.cjs
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
astro.config.ts		astro.config.ts
package-lock.json		package-lock.json
package.json		package.json
tailwind.config.cjs		tailwind.config.cjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Kanji usage frequency

Building the datasets

Building the website

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

Uh oh!

scriptin/kanji-frequency

Folders and files

Latest commit

History

Repository files navigation

Kanji usage frequency

Building the datasets

Building the website

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages