Skip to content

Ranges in DerivedNames for Rugrep #2

@noraj

Description

@noraj

The issue is there are several ranges in DerivedName.txt

➜ cat data/DerivedName.txt | grep '\.\.'                                                                  │
3400..4DBF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
4E00..9FFF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
F900..FA6D    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
FA70..FAD9    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
17000..187F7  ; TANGUT IDEOGRAPH-*                                                                        │
18B00..18CD5  ; KHITAN SMALL SCRIPT CHARACTER-*                                                           │
18D00..18D08  ; TANGUT IDEOGRAPH-*                                                                        │
1B170..1B2FB  ; NUSHU CHARACTER-*                                                                         │
20000..2A6DF  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2A700..2B739  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B740..2B81D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B820..2CEA1  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2CEB0..2EBE0  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2EBF0..2EE5D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2F800..2FA1D  ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
30000..3134A  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
31350..323AF  ; CJK UNIFIED IDEOGRAPH-*    

actually this code was casting the hex code point to decimal code point

https://github.com/Acceis/unisec/blob/6ba37eaa22cefa1995dba8312d6cdbc4f1234904/lib/unisec/rugrep.rb#L41

which is ignoring ranges

irb(main):001:0> '2CEB0..2EBE0'.to_i(16)
=> 183984
irb(main):002:0> '2CEB0'.to_i(16)
=> 183984

So ranges are displayed as a single code point

➜ unisec grep '' | grep 'NUSHU'
U+16FE1 𖿡    NUSHU ITERATION MARK
U+1B170 𛅰    NUSHU CHARACTER-*

Solutions :

  1. Parse this better to display ranges with a horizontal ellipsis
    • Pros: keep one command
    • Cons: add code complexity, output is inconsistent (bad for piping to other commands)
  2. Add a sub-command named ranges
    • Pros: keep consistent output for the grep command
    • Cons: split in several commands
  3. Pad range end to the name, eg. U+1B170 𛅰 NUSHU CHARACTER-* (up to U+1B2FB)
    • Pros: keep on command, code point column is consistent
    • Cons: name column becomes unreliable (information appended to the name)
  4. Expending the name dynamically
    • Pros: no inconsistency, no unreliable column
    • Cons: for matching result the output will be quite large for not so much value and become unreadable
  5. Adding a third field for comments
    • New behavior just for a few exceptions

Eg. of name expansion for idea n°4 http://www.unicode.org/charts/beta/nameslist/n_F900.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions