"UTF8" to "CESU-8" mapping

In one our project multiple applications are sharing a single large Oracle 19c DB - this includes both Ruby and non-Ruby apps.
Recently, we were faced with a curious bug where emoji character persisted in DB from non-Ruby application would be unprocessable (with `"source sequence is illegal/malformed utf-8"` error) when read back in Ruby (on Rails) application.

After some debugging it was found that an emoji (🙈) that was saved from non-Ruby application is stored in DB as raw bytes `EDA0BDEDB988`. The same emoji when saved from Ruby application was written to DB instead as `F09F9988`.
Per helpful [table](https://www.fileformat.info/info/unicode/char/1f648/charset_support.htm) it was then understood that Ruby application writes data in "UTF-8" encoding, but non-Ruby apps are using a "CESU-8" encoding.

We took a look (`SELECT * FROM nls_database_parameters`) afterwards at a DB configuration of "CHARACTERSET" values:

|Parameter|Value|
|----------|------|
|NLS_NCHAR_CHARACTERSET|AL16UTF16|
|NLS_CHARACTERSET|UTF8|

Obviously, in Oracle's infinite knowledge the "UTF8"  here [doesn't mean "UTF-8", but rather a weird "CESU-8" encoding form](https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/supporting-multilingual-databases-with-unicode.html#GUID-CD422E4F-C5C6-4E22-B95F-CA9CABBCB543).

Which lead to a curious discovery that our Ruby application is likely using an incorrect connection encoding at OCI8 level. Per our understanding and tests, the `OCI8` gem assigns connection's encoding in ["encoding-init.rb"](https://github.com/kubo/ruby-oci8/blob/master/lib/oci8/encoding-init.rb) bootup code. If we modified/used an initializer of our own that forced `OCI8.encoding = Encoding::CESU_8`, we could adequately read the original "CESU-8" encoded form of characters and new persistance from Ruby side would also follow the "CESU-8" conventions.

In summary, I would like to know, if this line

https://github.com/kubo/ruby-oci8/blob/959482365e18d7f25796de39ebb3908cd674b5e5/lib/oci8/encoding.yml#L388

shouldn't rather read as

```diff
--- UTF8:              UTF-8
+++ UTF8:              [CESU-8, UTF-8]
```

Ruby 2.7 [added](https://bugs.ruby-lang.org/issues/15931) a "CESU-8" encoding, albeit with a slightest [caveat](https://bugs.ruby-lang.org/issues/15995#note-2) concerning handling of `NULL` characters.

```
$ ruby -e "p RUBY_VERSION; p Encoding.find('CESU-8')"
"2.7.6"
#<Encoding:CESU-8>
```

Information on Ruby, "CESU-8" and Oracle DB is very sparse, so any discussion or suggestion would be helpful!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"UTF8" to "CESU-8" mapping #258

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parameter	Value
NLS_NCHAR_CHARACTERSET	AL16UTF16
NLS_CHARACTERSET	UTF8

"UTF8" to "CESU-8" mapping #258

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions