More (useless) code point normalization

- Big5-HKSCS EUDA→PUA: https://github.com/stanfordnlp/CoreNLP/wiki/Chinese-Private-Use-Area-code-points (crappy script indeed)
- [Reverse the Source Separation Rule](https://zh.wikipedia.org/wiki/User:Sl/wiki/temto) as if we are facing CJK extension blocks and doing some regular normalization (which gets rid of compatibility code points). Might be interesting for feeding the LM text from different CJK languages to obtain blocky soup...
  - Idea: We can let the bot generate pseudo-pseudo-Chinese (偽偽中国語, a.k.a. `zh@face_white`) by feeding it Japanese text without kana and normal Chinese. This normalization may help the LM stir things further by explicitly showing character equivalence, although some extra intervention to revert Han simplification on both scripts can be more helpful.
  - Somehow similar to unifying `zh-Hans` and `zh-Hant` with OpenCC, but really more ambitious and playful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More (useless) code point normalization #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More (useless) code point normalization #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions