-
Notifications
You must be signed in to change notification settings - Fork 79
fix(isthmus): concat of different string types #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@Test | ||
public void concatStringLiteralAndVarchar() throws IOException, SqlParseException { | ||
assertFullRoundTrip("select 'part_'||P_NAME from PART"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I look at this test I'm actually somewhat suprised that this doesn't throw an exception of some kind in the Substrait layer.
The POJO this produces has the following scalar function invocation
ScalarFunctionInvocation{
declaration=concat:vchar,
arguments=[
FixedCharLiteral{nullable=false, value=part_},
FieldReference{
segments=[StructField{offset=1}],
type=VarChar{nullable=true, length=55}}
]
, options=[],
outputType=VarChar{nullable=true, length=60}
}
As I understand it, this is an invalid invocation of concat:vchar
from functions_string.yaml
because all arguments must be varchars. Substrait, intentionally, doesn't have any form of implicit casting to allow calls to this function that mix string types.
While SQL may allow us to concat
different string types together, if we want to encode this into valid Substrait we'll need to upcast the arguments to the same type when translating into Substrait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. The code changed is used by function matching code for cases where the types don't exactly match. Since the change didn't break any tests I (wrongly!) assumed it was successfully fixing the problem. I will dig more into the matching to find out if any type coercion should be happening and, depending on the result, either fix it or implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type system change you added definitely helped.
ScalarFunctionInvocation{
declaration=concat:vchar,
arguments=[
Cast{type=VarChar{nullable=true, length=55}, input=FixedCharLiteral{nullable=false, value=part_}, failureBehavior=THROW_EXCEPTION},
FieldReference{segments=[StructField{offset=1}], type=VarChar{nullable=true, length=55}}
],
options=[],
outputType=VarChar{nullable=true, length=60}
}
Both inputs are now varchars 🎉
The output schema doesn't quite match Substrait still, which would expect VarChar<55>
instead of VarChar<60>
.
That's happening because the Calcite definition of CONCAT
uses ReturnTypes.DYADIC_STRING_SUM_PRECISION_NULLABLE for its return type inference which ends up doing the following
Type-inference strategy for concatenating two string arguments. The result type of a call is:
- the same type as the input types but with the combined length of the two first types
- if types are of char type the type with the highest coercibility will be used
- result is varying if either input is; otherwise fixed
Pre-requisites:
- input types must be of the same string type
- types must be comparable without casting
I think we can tackle that as part of #379
Marking as draft until I can produce a better solution to the problems highlighted. I will mark for review again when ready. |
@vbarua Do you have any guidance on the best testing approach? The existing and added tests did not catch the not-quite-correct change I made previously, and it would be good for those tests to give better confidence that the conversion is working correctly. |
@vbarua I need some help untangling casting of arguments to the types required by a Substrait function. Given there is no exact match for the function signature, I can find a function to which I think I can coerce the supplied call. To do the coercion for operands that don't match the argument type required by the function, I need to create a cast. This requires the target How do I get the |
I don't have a specific approach in mind. It will likely requiere a combination of things. One thing that would be good to investigate would be adding further validation and/or validators for plans generally to catch issues like that.
Unfortunately I'm not too familiar with that machinary. Someone in the Slack might know more. Additionally, I wanted to add that adding cast is one way to solve this problem. There might be others, but it's what came to my mind first at least. |
@vbarua In the current (draft) iteration of the code I did some horrible, horrible ParameterizedType to Type mapping using visitors. This allows casts to be constructed for different string types, which at first glance look plausible. For example, this SQL: select 'store' || s_store_name from store generates Substrait including this (redacted) mapping: {
"extensionUris": [{
"extensionUriAnchor": 1,
"uri": "/functions_string.yaml"
}],
"extensions": [{
"extensionFunction": {
"extensionUriReference": 1,
"name": "concat:vchar"
}
}],
"relations": [{
"root": {
"input": {
"project": {
[...]
"expressions": [{
"scalarFunction": {
"outputType": {
"varchar": {
"length": 55,
"nullability": "NULLABILITY_REQUIRED"
}
},
"arguments": [{
"value": {
"cast": {
"type": {
"varchar": {
"length": 5,
"nullability": "NULLABILITY_REQUIRED"
}
},
"input": {
"literal": {
"fixedChar": "store"
}
},
"failureBehavior": "FAILURE_BEHAVIOR_THROW_EXCEPTION"
}
}
}, {
"value": {
"selection": {
"directReference": {
"structField": {
"field": 5
}
},
"rootReference": {
}
}
}
}]
}
}]
}
},
"names": ["EXPR$0"]
}
}]
} Unfortunately not all cases round-trip, and it breaks a few other test cases. If this looks to you like it is heading the right direction, I can try to get the failing test cases working again. What do you think? |
That does seem reasonable to me. One other thing to note is that
As I understand the spec currently, if the output type of your
Another way to approach this might be to upcast the arguments in Calcite before mapping into Substrait, though I don't actually know if there is anything for this within Calcite. |
d5f059f
to
519cd6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that this was ready for re-review. Took a preliminary look at everything except the FunctionConverter, for which I need fresh eyes.
@Test | ||
public void concatStringLiteralAndChar() throws Exception { | ||
assertProtoPlanRoundrip("select 'brand_'||P_BRAND from PART"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Could you move these tests to the (new) FunctionConversionTest.java class.
I'm trying to group tests related to specific functions there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@Test | ||
public void concatStringLiteralAndVarchar() throws IOException, SqlParseException { | ||
assertFullRoundTrip("select 'part_'||P_NAME from PART"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type system change you added definitely helped.
ScalarFunctionInvocation{
declaration=concat:vchar,
arguments=[
Cast{type=VarChar{nullable=true, length=55}, input=FixedCharLiteral{nullable=false, value=part_}, failureBehavior=THROW_EXCEPTION},
FieldReference{segments=[StructField{offset=1}], type=VarChar{nullable=true, length=55}}
],
options=[],
outputType=VarChar{nullable=true, length=60}
}
Both inputs are now varchars 🎉
The output schema doesn't quite match Substrait still, which would expect VarChar<55>
instead of VarChar<60>
.
That's happening because the Calcite definition of CONCAT
uses ReturnTypes.DYADIC_STRING_SUM_PRECISION_NULLABLE for its return type inference which ends up doing the following
Type-inference strategy for concatenating two string arguments. The result type of a call is:
- the same type as the input types but with the combined length of the two first types
- if types are of char type the type with the highest coercibility will be used
- result is varying if either input is; otherwise fixed
Pre-requisites:
- input types must be of the same string type
- types must be comparable without casting
I think we can tackle that as part of #379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks reasonable to me.
@bestbeforetoday could you confirm for me that there are no functionality changes in FunctionConverter.java
? From what I reviewed, it looks like there are only cleanups and code simplifications.
Additionally, could you move the tests in ExpressionConvertabilityTest.java as per https://github.com/substrait-io/substrait-java/pull/338/files#r2038494522 and fix the conflicts.
Once those things are done, I'm happy to merge this.
return Optional.of(function); | ||
} | ||
private Optional<F> signatureMatch(List<Type> inputTypes, Type outputType) { | ||
for (F function : functions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REVIEW NOTES: This works because operator
was never used, and getSignatureMathcher
was only ever called with the functions
attached the FunctionFinder
class.
outputType)); | ||
} | ||
return Optional.empty(); | ||
var out = singularInputType.orElseThrow().tryMatch(type, outputType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REVIEW NOTE: get()
has the same behaviour as orElseThrow()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. No functional change by using orElseThrow()
. The Java API documentation for get()
says that orElseThrow()
is preferred. I guess just for consistency with the other orElse...
methods.
call.getOperands(), | ||
(a, b) -> { | ||
Type type = typeConverter.toSubstrait(b.getType()); | ||
return coerceArgument(a, type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nice cleanup re-using operandTypes
here
} | ||
|
||
protected abstract T generateBinding( | ||
C call, F function, List<FunctionArg> arguments, Type outputType); | ||
C call, F function, List<? extends FunctionArg> arguments, Type outputType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Good call making arguments
allow any FunctionArg. Make some of the above code nicer.
return Optional.empty(); | ||
}; | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REVIEW NOTE: this code is unused
return true; | ||
} | ||
return inputType.accept(new IgnoreNullableAndParameters(type)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REVIEW NOTE: this code is redundant with isMatch(ParameterizedType actualType, ParameterizedType targetType)
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as waiting for changes.
The type matching used when mapping Calcite to Substrait expressions required parameters and return type of the concat function to be either all varchar or all string. Any mixing of types caused a failure. Signed-off-by: Mark S. Lewis <Mark.S.Lewis@outlook.com>
@vbarua There are no longer any functionality changes to I have rebased on the latest main branch state and moved unit tests to the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, I will merge this later today.
@bestbeforetoday thanks for working on this. In the future, it would be helpful to call out that changes are mostly refactors and/or split them into their own PRs. The FunctionConverter code is pretty hairy, even for me, so I kept trying to find a time when I was mentally fresh to review them, but it turned out they were easy to reviews and I could have done it anytime really. |
@vbarua Are you able to merge this change so I can safely start additional changes to improve TPC-DS compatibility on top of it? |
@bestbeforetoday thanks for the reminder. Merged. |
The type matching used when mapping Calcite to Substrait expressions required parameters and return type of the concat function to be either all varchar or all string. Any mixing of types caused a failure.
This change relaxes the type matching to allow string types (varchar, char and string literal) to be used interchangeably.
Closes #69
Contributes to #105