[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335

mihailom-db · 2025-07-01T11:40:57Z

What changes were proposed in this pull request?

This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly.

Why are the changes needed?

Changes are needed for many reasons:

Context of primitiveDataType is constantly getting bigger. This is not a good practice, as we have many null fields which only take up memory.
We have inconsistencies in where we use each type. We get TIMESTAMP_NTZ in a separate rule, but we also mention it in primitive types.
Primitive types should stay related to primitive types, adding ARRAY, STRUCT, MAP in the rule just because it is convenient is not good practice.
Current structure does not give option of extending types with different features. For example, we introduced STRING collations, but what if we were to introduce CHAR/VARCHAR with collations. Current structure gives us 0 possibility of making a type CHAR(5) COLLATE UTF8_BINARY (We can only do CHAR COLLATE UTF8_BINARY (5)).

Does this PR introduce any user-facing change?

No. This is internal refactoring.

How was this patch tested?

All existing tests should pass, this is just code refactoring.

Was this patch authored or co-authored using generative AI tooling?

No.

mihailom-db · 2025-07-01T11:43:24Z

@MaxGekk @cloud-fan Could you take a look at this PR? We should aim to keep our parser in the most readable and most efficient state.

MaxGekk · 2025-07-01T12:48:36Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

+primitiveType
+    : primitiveTypeWithParameters
+    | primitiveTypeWithoutParameters
+    | unsupportedType=identifier (LEFT_PAREN INTEGER_VALUE(COMMA INTEGER_VALUE)* RIGHT_PAREN)?


what if we get an unsupportedType with a suffix like: TIME WITH TIME ZONE?

Yeah, this potentially can make a problem. I mean, we would probably return a bad error message. Let me think if we can scope issues like this as well.

Actually error message will stay the same, even previously we would return syntax error. The only thing here is if we want to improve the error messages a bit further?

cloud-fan · 2025-07-01T13:35:26Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

@@ -1340,7 +1340,20 @@ collateClause
    : COLLATE collationName=multipartIdentifier
    ;

-type
+primitiveTypeWithParameters
+    : STRING collateClause?


should this be primitiveTypeWithoutParameters?

It can go, but in this case I would say collation is a parameter as well. It can change it's value to some different value not known at parsing time. If we follow this case, then probably INTERVAL should go to primitiveTypeWithoutParameters, as it is actually without parameters.

By not known at parsing time, I mean identifier/arbitrary value.

cloud-fan · 2025-07-08T06:15:55Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

    | BINARY
-    | DECIMAL | DEC | NUMERIC
    | VOID
    | INTERVAL


shall we move this one into nonTrivialPrimitiveType?

Moved both INTERVAL and TIMESTAMP so that in the code it is clear that trivial types do not require post processing, but only return a specific type.

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala

cloud-fan · 2025-07-08T06:16:59Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala

@@ -165,6 +183,9 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
   * Create a complex DataType. Arrays, Maps and Structures are supported.
   */
  override def visitComplexDataType(ctx: ComplexDataTypeContext): DataType = withOrigin(ctx) {
+    if (ctx.LT() == null && ctx.NEQ() == null) {
+      throw QueryParsingErrors.nestedTypeMissingElementTypeError(ctx.getText, ctx)


is this a new error?

No, this is refactoring that I did. Previously if someone only writes STRUCT/ARRAY/MAP without parameters it would go to path of the primitive type. This is not a good practice, so I made a change that complex types are isolated in the separate context. The only change here is that we would only change the error message for when someone writes STRUCT(2) where it would return unsupported primitive type instead of the complex type missing element. We could argue that this is a change, but if you ask me, we need to distinguish between primitive and complex types first, as this is a general practice in type theory. We have primitive types which can be used as leaf arguments in complex types, we do not want to go into some recursive link between the two.

cloud-fan · 2025-07-09T00:36:36Z

thanks, merging to master!

…n parser ### What changes were proposed in this pull request? This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly. ### Why are the changes needed? Changes are needed for many reasons: 1. Context of primitiveDataType is constantly getting bigger. This is not a good practice, as we have many null fields which only take up memory. 2. We have inconsistencies in where we use each type. We get TIMESTAMP_NTZ in a separate rule, but we also mention it in primitive types. 3. Primitive types should stay related to primitive types, adding ARRAY, STRUCT, MAP in the rule just because it is convenient is not good practice. 4. Current structure does not give option of extending types with different features. For example, we introduced STRING collations, but what if we were to introduce CHAR/VARCHAR with collations. Current structure gives us 0 possibility of making a type CHAR(5) COLLATE UTF8_BINARY (We can only do CHAR COLLATE UTF8_BINARY (5)). ### Does this PR introduce _any_ user-facing change? No. This is internal refactoring. ### How was this patch tested? All existing tests should pass, this is just code refactoring. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51335 from mihailom-db/restructure-primitive. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Rework primitive

c9f8814

github-actions bot added the SQL label Jul 1, 2025

MaxGekk reviewed Jul 1, 2025

View reviewed changes

cloud-fan reviewed Jul 1, 2025

View reviewed changes

mihailom-db added 8 commits July 1, 2025 16:08

Fix decimal

96e7032

Move interval

4e17e22

Fix interval error

ae9f7de

Fix struct

473e101

Fix struct match

c761bbb

Fix struct match

d3347bb

Fix style

18c5be1

Fix parser namings

a9bb3e3

mihailom-db changed the title ~~[WIP] Fix inconsistencies and refactor primitive types in parser~~ [SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser Jul 8, 2025

cloud-fan reviewed Jul 8, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 8, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 8, 2025

View reviewed changes

Incorporate feedback

c91c4d5

cloud-fan approved these changes Jul 8, 2025

View reviewed changes

Fix compile

eda9511

cloud-fan closed this in a46296e Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335

[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335

Uh oh!

mihailom-db commented Jul 1, 2025

Uh oh!

mihailom-db commented Jul 1, 2025

Uh oh!

MaxGekk Jul 1, 2025

Uh oh!

mihailom-db Jul 1, 2025

Uh oh!

mihailom-db Jul 1, 2025

Uh oh!

cloud-fan Jul 1, 2025

Uh oh!

mihailom-db Jul 1, 2025

Uh oh!

mihailom-db Jul 1, 2025

Uh oh!

cloud-fan Jul 8, 2025

Uh oh!

mihailom-db Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

cloud-fan Jul 8, 2025

Uh oh!

mihailom-db Jul 8, 2025

Uh oh!

cloud-fan commented Jul 9, 2025

Uh oh!

Uh oh!

[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335

[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335

Uh oh!

Conversation

mihailom-db commented Jul 1, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mihailom-db commented Jul 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 9, 2025

Uh oh!

Uh oh!