Unicode handling has been broken

**Describe the bug**

RedditExtractoR returns broken comment body if it contains non-ASCII Unicode.

**To Reproduce**

```
$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
```

**Expected behavior**

```
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"
```

**Desktop (please complete the following information):**

* Linux fedora 6.1.7-200.fc37.x86_64
* RedditExtractoR commit ecb9a86e
* R 4.2.2
* JSONIO 1.3-1.8

**Additional context**

Here are the details. I tried to get [this comment](https://old.reddit.com/r/translator/comments/10wr3xg/_/j7omapc/) that contains non-ASCII Characters:

```
零式艦上戦闘機二一型 Type zero carrier fighter model 21

https://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero
```

but RedditExtractoR returns broken comment:

```
$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
[...]
```
It's because reddit's JSON escapes non-ASCII characters,

```
$ curl -A 'API Test (by /u/nmtake)' 'https://old.reddit.com/r/translator/comments/10wr3xg/.json' > japanese.json
$ cat japanese.json
[...]
"body": "\u96f6\u5f0f\u8266\u4e0a\u6226\u95d8\u6a5f\u4e8c\u4e00\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero",
```

and RJSONIO doesn't seem to be able to handle such unicode escapes:

```
> ret = RJSONIO::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', asText = TRUE)
> ret
[1] "\xf6\017f\n&\xd8_\x8c"
> iconv(ret, 'latin1', 'UTF-8')  # reproduce the original broken text
[1] "ö\017f\n&Ø_\u008c"
```

Please note that the trailing characters after `\\u4e00` are all dropped.
I suspect RJSONIO treats `00` as ASCII NIL (C string terminator).

---

FYI, with `jsonlite::fromJSON(simplifyVector = FALSE)`, we can get correct text:

```
> jsonlite::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\\n\\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', simplifyVector = FALSE)
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"
```






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode handling has been broken #36

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unicode handling has been broken #36

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions