Skip to content

feat: mapping sql Char/Text/String default to Utf8View #16290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 17, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jun 6, 2025

Which issue does this PR close?

Rationale for this change

Similar to

#15096

But most work we have done, so it's easier for us to finish this ticket.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@zhuqi-lucas zhuqi-lucas marked this pull request as draft June 6, 2025 09:39
@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 6, 2025
@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Jun 6, 2025
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review June 6, 2025 09:53
@zhuqi-lucas zhuqi-lucas marked this pull request as draft June 6, 2025 09:58
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jun 8, 2025
@@ -1202,7 +1202,7 @@ pub fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataT
fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
(Utf8 | LargeUtf8, other_type) | (other_type, Utf8 | LargeUtf8)
(Utf8 | LargeUtf8 | Utf8View, other_type) | (other_type, Utf8 | LargeUtf8 | Utf8View)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fix the bug, Utf8View with other numeric type can't coercion.

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review June 8, 2025 11:54
@@ -61,22 +61,12 @@ LOCATION '../core/tests/data/partitioned_table_arrow/'
PARTITIONED BY (part);

# select wildcard
query ITBI
query error DataFusion error: Arrow error: External error: Arrow error: Invalid argument error: column types must match schema types, expected Utf8View but found Utf8 at column index 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @xudong963 @alamb , this is the only remaining issue, do you know why Arrow format do not support Utf8 with Utf8View? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me like there is a mismatch between what is declared in the plan and what the actual types are. Are you able to figure out what the stack trace is that throws this error?

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb , after investigate i believe i found the root cause:

We using fixed arrow file to test for sqllogictests, and this arrow field is writing with arrow-ipc utf8, so when we decode to read it's also loading utf8. But we default the field for sql to mapping to utf8view for this PR, so when we create the record batch we add checking default, it will failed.

@github-actions github-actions bot added documentation Improvements or additions to documentation common Related to common crate labels Jun 11, 2025
@zhuqi-lucas
Copy link
Contributor Author

Hi @alamb @xudong963 , i have fixed all the corner cases in this PR, it's ready for review! Thanks!

Comment on lines 265 to 271
pub map_varchar_to_utf8view: bool, default = true

/// If true, `CHAR` and `Text` and `String` is mapped to `Utf8View` during SQL planning.
/// If false, `CHAR` and `Text` and `String` is mapped to `Utf8` during SQL planning.
/// Default is true.
pub map_char_to_utf8view: bool, default = true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the two configs? I mean, if we can directly use one.

Imagine that one is true, and another is false, does it make sense? Before we didn't have utf8view, varchar,char, text, string all will be utf8, they're consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xudong963 for review, this is true, we can use one config, maybe the name should be:

map_sql_string_to_utf8view for unifed all sql string to utf8view

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

/// If true, string types (VARCHAR, CHAR, Text, and String) are mapped to `Utf8View` during SQL planning.
/// If false, they are mapped to `Utf8`.
/// Default is true.
pub map_string_types_to_utf8view: bool, default = true

Also cc @alamb

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

/// If true, string types (VARCHAR, CHAR, Text, and String) are mapped to `Utf8View` during SQL planning.
/// If false, they are mapped to `Utf8`.
/// Default is true.
pub map_string_types_to_utf8view: bool, default = true

Also cc @alamb

Thanks @xudong963 looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest PR, thanks!

@xudong963
Copy link
Member

Some tests are failing

@zhuqi-lucas
Copy link
Contributor Author

Some tests are failing

Fixed in latest PR, thank you @xudong963 !

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas and @xudong963

/// If true, string types (VARCHAR, CHAR, Text, and String) are mapped to `Utf8View` during SQL planning.
/// If false, they are mapped to `Utf8`.
/// Default is true.
pub map_string_types_to_utf8view: bool, default = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an API change, so perhaps we can add a note about this in the upgrade guide (that we changed the name of the config setting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for review, i agree we need to add this to the upgrade guide for next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a ticket to record it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created the ticket:

#16428

@@ -6082,7 +6082,7 @@ physical_plan
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "7f4b18de3cfeb9b4ac78c381ee2ad278", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "a", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "b", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "c", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
07)------------FilterExec: substr(CAST(md5(CAST(value@0 AS Utf8View)) AS Utf8View), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "7f4b18de3cfeb9b4ac78c381ee2ad278", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "a", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "b", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "c", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 this looks like it may have added extra casts I wonder if it because md5 doesn't support StringView natively 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you @alamb , i support it now in latest PR.

@github-actions github-actions bot added the functions Changes to functions implementation label Jun 17, 2025
@xudong963 xudong963 merged commit 06631c2 into apache:main Jun 17, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions physical-plan Changes to the physical-plan crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mapping Char/Text/String default to Utf8View
3 participants