-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52706][SQL] Fix inconsistencies and refactor primitive types in parser #51335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@MaxGekk @cloud-fan Could you take a look at this PR? We should aim to keep our parser in the most readable and most efficient state. |
primitiveType | ||
: primitiveTypeWithParameters | ||
| primitiveTypeWithoutParameters | ||
| unsupportedType=identifier (LEFT_PAREN INTEGER_VALUE(COMMA INTEGER_VALUE)* RIGHT_PAREN)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we get an unsupportedType
with a suffix like: TIME WITH TIME ZONE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this potentially can make a problem. I mean, we would probably return a bad error message. Let me think if we can scope issues like this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually error message will stay the same, even previously we would return syntax error. The only thing here is if we want to improve the error messages a bit further?
@@ -1340,7 +1340,20 @@ collateClause | |||
: COLLATE collationName=multipartIdentifier | |||
; | |||
|
|||
type | |||
primitiveTypeWithParameters | |||
: STRING collateClause? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be primitiveTypeWithoutParameters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can go, but in this case I would say collation is a parameter as well. It can change it's value to some different value not known at parsing time. If we follow this case, then probably INTERVAL should go to primitiveTypeWithoutParameters, as it is actually without parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By not known at parsing time, I mean identifier/arbitrary value.
| BINARY | ||
| DECIMAL | DEC | NUMERIC | ||
| VOID | ||
| INTERVAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this one into nonTrivialPrimitiveType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved both INTERVAL and TIMESTAMP so that in the code it is clear that trivial types do not require post processing, but only return a specific type.
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala
Outdated
Show resolved
Hide resolved
@@ -165,6 +183,9 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] { | |||
* Create a complex DataType. Arrays, Maps and Structures are supported. | |||
*/ | |||
override def visitComplexDataType(ctx: ComplexDataTypeContext): DataType = withOrigin(ctx) { | |||
if (ctx.LT() == null && ctx.NEQ() == null) { | |||
throw QueryParsingErrors.nestedTypeMissingElementTypeError(ctx.getText, ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a new error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is refactoring that I did. Previously if someone only writes STRUCT/ARRAY/MAP without parameters it would go to path of the primitive type. This is not a good practice, so I made a change that complex types are isolated in the separate context. The only change here is that we would only change the error message for when someone writes STRUCT(2)
where it would return unsupported primitive type instead of the complex type missing element. We could argue that this is a change, but if you ask me, we need to distinguish between primitive and complex types first, as this is a general practice in type theory. We have primitive types which can be used as leaf arguments in complex types, we do not want to go into some recursive link between the two.
thanks, merging to master! |
…n parser ### What changes were proposed in this pull request? This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly. ### Why are the changes needed? Changes are needed for many reasons: 1. Context of primitiveDataType is constantly getting bigger. This is not a good practice, as we have many null fields which only take up memory. 2. We have inconsistencies in where we use each type. We get TIMESTAMP_NTZ in a separate rule, but we also mention it in primitive types. 3. Primitive types should stay related to primitive types, adding ARRAY, STRUCT, MAP in the rule just because it is convenient is not good practice. 4. Current structure does not give option of extending types with different features. For example, we introduced STRING collations, but what if we were to introduce CHAR/VARCHAR with collations. Current structure gives us 0 possibility of making a type CHAR(5) COLLATE UTF8_BINARY (We can only do CHAR COLLATE UTF8_BINARY (5)). ### Does this PR introduce _any_ user-facing change? No. This is internal refactoring. ### How was this patch tested? All existing tests should pass, this is just code refactoring. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51335 from mihailom-db/restructure-primitive. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…n parser ### What changes were proposed in this pull request? This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly. ### Why are the changes needed? Changes are needed for many reasons: 1. Context of primitiveDataType is constantly getting bigger. This is not a good practice, as we have many null fields which only take up memory. 2. We have inconsistencies in where we use each type. We get TIMESTAMP_NTZ in a separate rule, but we also mention it in primitive types. 3. Primitive types should stay related to primitive types, adding ARRAY, STRUCT, MAP in the rule just because it is convenient is not good practice. 4. Current structure does not give option of extending types with different features. For example, we introduced STRING collations, but what if we were to introduce CHAR/VARCHAR with collations. Current structure gives us 0 possibility of making a type CHAR(5) COLLATE UTF8_BINARY (We can only do CHAR COLLATE UTF8_BINARY (5)). ### Does this PR introduce _any_ user-facing change? No. This is internal refactoring. ### How was this patch tested? All existing tests should pass, this is just code refactoring. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51335 from mihailom-db/restructure-primitive. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly.
Why are the changes needed?
Changes are needed for many reasons:
Does this PR introduce any user-facing change?
No. This is internal refactoring.
How was this patch tested?
All existing tests should pass, this is just code refactoring.
Was this patch authored or co-authored using generative AI tooling?
No.