Skip to content

fix(isthmus): concat of different string types #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 23, 2025

Conversation

bestbeforetoday
Copy link
Contributor

@bestbeforetoday bestbeforetoday commented Mar 5, 2025

The type matching used when mapping Calcite to Substrait expressions required parameters and return type of the concat function to be either all varchar or all string. Any mixing of types caused a failure.

This change relaxes the type matching to allow string types (varchar, char and string literal) to be used interchangeably.

Closes #69
Contributes to #105

@bestbeforetoday bestbeforetoday marked this pull request as ready for review March 5, 2025 15:45
@Test
public void concatStringLiteralAndVarchar() throws IOException, SqlParseException {
assertFullRoundTrip("select 'part_'||P_NAME from PART");
}
Copy link
Member

@vbarua vbarua Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I look at this test I'm actually somewhat suprised that this doesn't throw an exception of some kind in the Substrait layer.

The POJO this produces has the following scalar function invocation

ScalarFunctionInvocation{
    declaration=concat:vchar,
    arguments=[
        FixedCharLiteral{nullable=false, value=part_},
        FieldReference{
            segments=[StructField{offset=1}],
            type=VarChar{nullable=true, length=55}}
    ]
    , options=[],
    outputType=VarChar{nullable=true, length=60}
}

As I understand it, this is an invalid invocation of concat:vchar from functions_string.yaml
because all arguments must be varchars. Substrait, intentionally, doesn't have any form of implicit casting to allow calls to this function that mix string types.

While SQL may allow us to concat different string types together, if we want to encode this into valid Substrait we'll need to upcast the arguments to the same type when translating into Substrait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. The code changed is used by function matching code for cases where the types don't exactly match. Since the change didn't break any tests I (wrongly!) assumed it was successfully fixing the problem. I will dig more into the matching to find out if any type coercion should be happening and, depending on the result, either fix it or implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type system change you added definitely helped.

ScalarFunctionInvocation{
    declaration=concat:vchar,
    arguments=[
        Cast{type=VarChar{nullable=true, length=55}, input=FixedCharLiteral{nullable=false, value=part_}, failureBehavior=THROW_EXCEPTION},
        FieldReference{segments=[StructField{offset=1}], type=VarChar{nullable=true, length=55}}
    ],
    options=[],
    outputType=VarChar{nullable=true, length=60}
}

Both inputs are now varchars 🎉

The output schema doesn't quite match Substrait still, which would expect VarChar<55> instead of VarChar<60>.

That's happening because the Calcite definition of CONCAT
uses ReturnTypes.DYADIC_STRING_SUM_PRECISION_NULLABLE for its return type inference which ends up doing the following

Type-inference strategy for concatenating two string arguments. The result type of a call is:

  • the same type as the input types but with the combined length of the two first types
  • if types are of char type the type with the highest coercibility will be used
  • result is varying if either input is; otherwise fixed

Pre-requisites:

  • input types must be of the same string type
  • types must be comparable without casting

I think we can tackle that as part of #379

@bestbeforetoday
Copy link
Contributor Author

Marking as draft until I can produce a better solution to the problems highlighted. I will mark for review again when ready.

@bestbeforetoday bestbeforetoday marked this pull request as draft March 6, 2025 10:33
@bestbeforetoday
Copy link
Contributor Author

@vbarua Do you have any guidance on the best testing approach? The existing and added tests did not catch the not-quite-correct change I made previously, and it would be good for those tests to give better confidence that the conversion is working correctly.

@bestbeforetoday
Copy link
Contributor Author

@vbarua I need some help untangling casting of arguments to the types required by a Substrait function. Given there is no exact match for the function signature, I can find a function to which I think I can coerce the supplied call. To do the coercion for operands that don't match the argument type required by the function, I need to create a cast. This requires the target Type. The Function can only give me the ParameterizedType of the arguments. Confusingly, ParameterizedType is a superclass of Type, and the argument "types" I get from the Function are not instances of Type.

How do I get the Type values for a given Function arguments?

@vbarua
Copy link
Member

vbarua commented Mar 20, 2025

Do you have any guidance on the best testing approach? The existing and added tests did not catch the not-quite-correct change I made previously, and it would be good for those tests to give better confidence that the conversion is working correctly.

I don't have a specific approach in mind. It will likely requiere a combination of things. One thing that would be good to investigate would be adding further validation and/or validators for plans generally to catch issues like that.

How do I get the Type values for a given Function arguments?

Unfortunately I'm not too familiar with that machinary. Someone in the Slack might know more.

Additionally, I wanted to add that adding cast is one way to solve this problem. There might be others, but it's what came to my mind first at least.

@bestbeforetoday
Copy link
Contributor Author

@vbarua In the current (draft) iteration of the code I did some horrible, horrible ParameterizedType to Type mapping using visitors. This allows casts to be constructed for different string types, which at first glance look plausible. For example, this SQL:

select 'store' || s_store_name from store

generates Substrait including this (redacted) mapping:

{
  "extensionUris": [{
    "extensionUriAnchor": 1,
    "uri": "/functions_string.yaml"
  }],
  "extensions": [{
    "extensionFunction": {
      "extensionUriReference": 1,
      "name": "concat:vchar"
    }
  }],
  "relations": [{
    "root": {
      "input": {
        "project": {
          [...]
          "expressions": [{
            "scalarFunction": {
              "outputType": {
                "varchar": {
                  "length": 55,
                  "nullability": "NULLABILITY_REQUIRED"
                }
              },
              "arguments": [{
                "value": {
                  "cast": {
                    "type": {
                      "varchar": {
                        "length": 5,
                        "nullability": "NULLABILITY_REQUIRED"
                      }
                    },
                    "input": {
                      "literal": {
                        "fixedChar": "store"
                      }
                    },
                    "failureBehavior": "FAILURE_BEHAVIOR_THROW_EXCEPTION"
                  }
                }
              }, {
                "value": {
                  "selection": {
                    "directReference": {
                      "structField": {
                        "field": 5
                      }
                    },
                    "rootReference": {
                    }
                  }
                }
              }]
            }
          }]
        }
      },
      "names": ["EXPR$0"]
    }
  }]
}

Unfortunately not all cases round-trip, and it breaks a few other test cases. If this looks to you like it is heading the right direction, I can try to get the failing test cases working again. What do you think?

@vbarua
Copy link
Member

vbarua commented Mar 20, 2025

That does seem reasonable to me. One other thing to note is that concat:vchar is defined with a length param L1:

    impls:
      - args:
          - value: "varchar<L1>"
            name: "input"
        variadic:
          min: 1
        options:
          null_handling:
            values: [ IGNORE_NULLS, ACCEPT_NULLS ]
        return: "varchar<L1>"

As I understand the spec currently, if the output type of your concat:vchar declaration is varchar<55>, then all inputs must also be varchar<55>.

I did some horrible, horrible ParameterizedType to Type mapping using visitors.

Another way to approach this might be to upcast the arguments in Calcite before mapping into Substrait, though I don't actually know if there is anything for this within Calcite.

@bestbeforetoday bestbeforetoday force-pushed the concat branch 3 times, most recently from d5f059f to 519cd6f Compare March 28, 2025 21:26
@bestbeforetoday bestbeforetoday marked this pull request as ready for review March 28, 2025 22:51
@bestbeforetoday bestbeforetoday requested a review from vbarua April 1, 2025 10:03
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this was ready for re-review. Took a preliminary look at everything except the FunctionConverter, for which I need fresh eyes.

@Test
public void concatStringLiteralAndChar() throws Exception {
assertProtoPlanRoundrip("select 'brand_'||P_BRAND from PART");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Could you move these tests to the (new) FunctionConversionTest.java class.

I'm trying to group tests related to specific functions there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Test
public void concatStringLiteralAndVarchar() throws IOException, SqlParseException {
assertFullRoundTrip("select 'part_'||P_NAME from PART");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type system change you added definitely helped.

ScalarFunctionInvocation{
    declaration=concat:vchar,
    arguments=[
        Cast{type=VarChar{nullable=true, length=55}, input=FixedCharLiteral{nullable=false, value=part_}, failureBehavior=THROW_EXCEPTION},
        FieldReference{segments=[StructField{offset=1}], type=VarChar{nullable=true, length=55}}
    ],
    options=[],
    outputType=VarChar{nullable=true, length=60}
}

Both inputs are now varchars 🎉

The output schema doesn't quite match Substrait still, which would expect VarChar<55> instead of VarChar<60>.

That's happening because the Calcite definition of CONCAT
uses ReturnTypes.DYADIC_STRING_SUM_PRECISION_NULLABLE for its return type inference which ends up doing the following

Type-inference strategy for concatenating two string arguments. The result type of a call is:

  • the same type as the input types but with the combined length of the two first types
  • if types are of char type the type with the highest coercibility will be used
  • result is varying if either input is; otherwise fixed

Pre-requisites:

  • input types must be of the same string type
  • types must be comparable without casting

I think we can tackle that as part of #379

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks reasonable to me.

@bestbeforetoday could you confirm for me that there are no functionality changes in FunctionConverter.java? From what I reviewed, it looks like there are only cleanups and code simplifications.

Additionally, could you move the tests in ExpressionConvertabilityTest.java as per https://github.com/substrait-io/substrait-java/pull/338/files#r2038494522 and fix the conflicts.

Once those things are done, I'm happy to merge this.

return Optional.of(function);
}
private Optional<F> signatureMatch(List<Type> inputTypes, Type outputType) {
for (F function : functions) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REVIEW NOTES: This works because operator was never used, and getSignatureMathcher was only ever called with the functions attached the FunctionFinder class.

outputType));
}
return Optional.empty();
var out = singularInputType.orElseThrow().tryMatch(type, outputType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REVIEW NOTE: get() has the same behaviour as orElseThrow()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. No functional change by using orElseThrow(). The Java API documentation for get() says that orElseThrow() is preferred. I guess just for consistency with the other orElse... methods.

call.getOperands(),
(a, b) -> {
Type type = typeConverter.toSubstrait(b.getType());
return coerceArgument(a, type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nice cleanup re-using operandTypes here

}

protected abstract T generateBinding(
C call, F function, List<FunctionArg> arguments, Type outputType);
C call, F function, List<? extends FunctionArg> arguments, Type outputType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Good call making arguments allow any FunctionArg. Make some of the above code nicer.

return Optional.empty();
};
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REVIEW NOTE: this code is unused

return true;
}
return inputType.accept(new IgnoreNullableAndParameters(type));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REVIEW NOTE: this code is redundant with isMatch(ParameterizedType actualType, ParameterizedType targetType) below

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as waiting for changes.

The type matching used when mapping Calcite to Substrait expressions
required parameters and return type of the concat function to be either
all varchar or all string. Any mixing of types caused a failure.

Signed-off-by: Mark S. Lewis <Mark.S.Lewis@outlook.com>
@bestbeforetoday
Copy link
Contributor Author

@vbarua There are no longer any functionality changes to FunctionConverter.java. In previous iterations of this change there were but the type system change you suggested meant I could remove those changes. What is left is code refactor / clean-up that helped me to figure out what was actually happening, and to facilitate the (now removed) functional changes.

I have rebased on the latest main branch state and moved unit tests to the new FunctionConversionTest. I think all the review comments are addressed and this is ready for re-review. Please let me know if I missed anything.

@bestbeforetoday bestbeforetoday requested a review from vbarua April 22, 2025 13:53
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, I will merge this later today.

@vbarua
Copy link
Member

vbarua commented Apr 22, 2025

@bestbeforetoday thanks for working on this.

In the future, it would be helpful to call out that changes are mostly refactors and/or split them into their own PRs. The FunctionConverter code is pretty hairy, even for me, so I kept trying to find a time when I was mentally fresh to review them, but it turned out they were easy to reviews and I could have done it anytime really.

@bestbeforetoday
Copy link
Contributor Author

@vbarua Are you able to merge this change so I can safely start additional changes to improve TPC-DS compatibility on top of it?

@vbarua vbarua merged commit a1fbe53 into substrait-io:main Apr 23, 2025
12 checks passed
@vbarua
Copy link
Member

vbarua commented Apr 23, 2025

@bestbeforetoday thanks for the reminder. Merged.

@bestbeforetoday bestbeforetoday deleted the concat branch April 23, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ISTHMUS] [TPCDS] Fix Concat scalar function support
2 participants