Skip to content

More (useless) code point normalization #2

@Artoria2e5

Description

@Artoria2e5
  • Big5-HKSCS EUDA→PUA: https://github.com/stanfordnlp/CoreNLP/wiki/Chinese-Private-Use-Area-code-points (crappy script indeed)
  • Reverse the Source Separation Rule as if we are facing CJK extension blocks and doing some regular normalization (which gets rid of compatibility code points). Might be interesting for feeding the LM text from different CJK languages to obtain blocky soup...
    • Idea: We can let the bot generate pseudo-pseudo-Chinese (偽偽中国語, a.k.a. zh@face_white) by feeding it Japanese text without kana and normal Chinese. This normalization may help the LM stir things further by explicitly showing character equivalence, although some extra intervention to revert Han simplification on both scripts can be more helpful.
    • Somehow similar to unifying zh-Hans and zh-Hant with OpenCC, but really more ambitious and playful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions