Skip to content

Unicode handling has been broken #36

@nmtake

Description

@nmtake

Describe the bug

RedditExtractoR returns broken comment body if it contains non-ASCII Unicode.

To Reproduce

$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"

Expected behavior

> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"

Desktop (please complete the following information):

  • Linux fedora 6.1.7-200.fc37.x86_64
  • RedditExtractoR commit ecb9a86
  • R 4.2.2
  • JSONIO 1.3-1.8

Additional context

Here are the details. I tried to get this comment that contains non-ASCII Characters:

零式艦上戦闘機二一型 Type zero carrier fighter model 21

https://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero

but RedditExtractoR returns broken comment:

$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
[...]

It's because reddit's JSON escapes non-ASCII characters,

$ curl -A 'API Test (by /u/nmtake)' 'https://old.reddit.com/r/translator/comments/10wr3xg/.json' > japanese.json
$ cat japanese.json
[...]
"body": "\u96f6\u5f0f\u8266\u4e0a\u6226\u95d8\u6a5f\u4e8c\u4e00\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero",

and RJSONIO doesn't seem to be able to handle such unicode escapes:

> ret = RJSONIO::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', asText = TRUE)
> ret
[1] "\xf6\017f\n&\xd8_\x8c"
> iconv(ret, 'latin1', 'UTF-8')  # reproduce the original broken text
[1] "ö\017f\n&Ø_\u008c"

Please note that the trailing characters after \\u4e00 are all dropped.
I suspect RJSONIO treats 00 as ASCII NIL (C string terminator).


FYI, with jsonlite::fromJSON(simplifyVector = FALSE), we can get correct text:

> jsonlite::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\\n\\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', simplifyVector = FALSE)
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions