-
Notifications
You must be signed in to change notification settings - Fork 76
Description
In one our project multiple applications are sharing a single large Oracle 19c DB - this includes both Ruby and non-Ruby apps.
Recently, we were faced with a curious bug where emoji character persisted in DB from non-Ruby application would be unprocessable (with "source sequence is illegal/malformed utf-8"
error) when read back in Ruby (on Rails) application.
After some debugging it was found that an emoji (🙈) that was saved from non-Ruby application is stored in DB as raw bytes EDA0BDEDB988
. The same emoji when saved from Ruby application was written to DB instead as F09F9988
.
Per helpful table it was then understood that Ruby application writes data in "UTF-8" encoding, but non-Ruby apps are using a "CESU-8" encoding.
We took a look (SELECT * FROM nls_database_parameters
) afterwards at a DB configuration of "CHARACTERSET" values:
Parameter | Value |
---|---|
NLS_NCHAR_CHARACTERSET | AL16UTF16 |
NLS_CHARACTERSET | UTF8 |
Obviously, in Oracle's infinite knowledge the "UTF8" here doesn't mean "UTF-8", but rather a weird "CESU-8" encoding form.
Which lead to a curious discovery that our Ruby application is likely using an incorrect connection encoding at OCI8 level. Per our understanding and tests, the OCI8
gem assigns connection's encoding in "encoding-init.rb" bootup code. If we modified/used an initializer of our own that forced OCI8.encoding = Encoding::CESU_8
, we could adequately read the original "CESU-8" encoded form of characters and new persistance from Ruby side would also follow the "CESU-8" conventions.
In summary, I would like to know, if this line
ruby-oci8/lib/oci8/encoding.yml
Line 388 in 9594823
UTF8: UTF-8 |
shouldn't rather read as
--- UTF8: UTF-8
+++ UTF8: [CESU-8, UTF-8]
Ruby 2.7 added a "CESU-8" encoding, albeit with a slightest caveat concerning handling of NULL
characters.
$ ruby -e "p RUBY_VERSION; p Encoding.find('CESU-8')"
"2.7.6"
#<Encoding:CESU-8>
Information on Ruby, "CESU-8" and Oracle DB is very sparse, so any discussion or suggestion would be helpful!