Skip to content

Unicode support for word boundary \b #228

@gondalez

Description

@gondalez

Is it possible to extend the unicode support to the word boundary anchor?

For example the russian sentence cannot be split:

"hello there this is a test".split(XRegExp('\\b', 'A'))
(11) ["hello", " ", "there", " ", "this", " ", "is", " ", "a", " ", "test"]

"Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!".split(XRegExp('\\b', 'A'))
["Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!"]

^ note the split has no effect on russian

The equivalent and desired behaviour in ruby, for example:

irb(main):001:0> "hello there this is a test".split(/\b/)
[
  "hello",
  " ",
  "there",
  " ",
  "this",
  " ",
  "is",
  " ",
  "a",
  " ",
  "test"
]
irb(main):002:0> "Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!".split(/\b/)
[
  "Сняли",
  " ",
  "не",
  " ",
  "первый",
  " ",
  "раз",
  " ",
  "изначальную",
  " ",
  "и",
  " ",
  "конечную",
  " ",
  "сумму",
  " ",
  "и",
  " ",
  "начальную",
  " ",
  "не",
  " ",
  "вернули",
  " !!!"
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions