-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: mapping sql Char/Text/String default to Utf8View #16290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1202,7 +1202,7 @@ pub fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataT | |||
fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | |||
use arrow::datatypes::DataType::*; | |||
match (lhs_type, rhs_type) { | |||
(Utf8 | LargeUtf8, other_type) | (other_type, Utf8 | LargeUtf8) | |||
(Utf8 | LargeUtf8 | Utf8View, other_type) | (other_type, Utf8 | LargeUtf8 | Utf8View) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change fix the bug, Utf8View with other numeric type can't coercion.
@@ -61,22 +61,12 @@ LOCATION '../core/tests/data/partitioned_table_arrow/' | |||
PARTITIONED BY (part); | |||
|
|||
# select wildcard | |||
query ITBI | |||
query error DataFusion error: Arrow error: External error: Arrow error: Invalid argument error: column types must match schema types, expected Utf8View but found Utf8 at column index 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @xudong963 @alamb , this is the only remaining issue, do you know why Arrow format do not support Utf8 with Utf8View? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me like there is a mismatch between what is declared in the plan and what the actual types are. Are you able to figure out what the stack trace is that throws this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , after investigate i believe i found the root cause:
We using fixed arrow file to test for sqllogictests, and this arrow field is writing with arrow-ipc utf8, so when we decode to read it's also loading utf8. But we default the field for sql to mapping to utf8view for this PR, so when we create the record batch we add checking default, it will failed.
Hi @alamb @xudong963 , i have fixed all the corner cases in this PR, it's ready for review! Thanks! |
datafusion/common/src/config.rs
Outdated
pub map_varchar_to_utf8view: bool, default = true | ||
|
||
/// If true, `CHAR` and `Text` and `String` is mapped to `Utf8View` during SQL planning. | ||
/// If false, `CHAR` and `Text` and `String` is mapped to `Utf8` during SQL planning. | ||
/// Default is true. | ||
pub map_char_to_utf8view: bool, default = true | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the two configs? I mean, if we can directly use one.
Imagine that one is true, and another is false, does it make sense? Before we didn't have utf8view
, varchar,char, text, string all will be utf8, they're consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 for review, this is true, we can use one config, maybe the name should be:
map_sql_string_to_utf8view for unifed all sql string to utf8view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
/// If true, string types (VARCHAR, CHAR, Text, and String) are mapped to `Utf8View` during SQL planning.
/// If false, they are mapped to `Utf8`.
/// Default is true.
pub map_string_types_to_utf8view: bool, default = true
Also cc @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
/// If true, string types (VARCHAR, CHAR, Text, and String) are mapped to `Utf8View` during SQL planning. /// If false, they are mapped to `Utf8`. /// Default is true. pub map_string_types_to_utf8view: bool, default = true
Also cc @alamb
Thanks @xudong963 looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest PR, thanks!
Some tests are failing |
Fixed in latest PR, thank you @xudong963 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas and @xudong963
/// If true, string types (VARCHAR, CHAR, Text, and String) are mapped to `Utf8View` during SQL planning. | ||
/// If false, they are mapped to `Utf8`. | ||
/// Default is true. | ||
pub map_string_types_to_utf8view: bool, default = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an API change, so perhaps we can add a note about this in the upgrade guide (that we changed the name of the config setting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review, i agree we need to add this to the upgrade guide for next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file a ticket to record it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created the ticket:
@@ -6082,7 +6082,7 @@ physical_plan | |||
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] | |||
05)--------ProjectionExec: expr=[] | |||
06)----------CoalesceBatchesExec: target_batch_size=8192 | |||
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "7f4b18de3cfeb9b4ac78c381ee2ad278", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "a", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "b", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "c", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) | |||
07)------------FilterExec: substr(CAST(md5(CAST(value@0 AS Utf8View)) AS Utf8View), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "7f4b18de3cfeb9b4ac78c381ee2ad278", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "a", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "b", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "c", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 this looks like it may have added extra casts I wonder if it because md5
doesn't support StringView
natively 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you @alamb , i support it now in latest PR.
Which issue does this PR close?
Rationale for this change
Similar to
#15096
But most work we have done, so it's easier for us to finish this ticket.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?