-
Notifications
You must be signed in to change notification settings - Fork 158
Don't convert schema-defined strings to other types during validation #3761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
JsonNode parsedValue = mapper.readTree(textValue); | ||
// If successful, replace the string with the parsed JSON | ||
parametersNode.set(key, parsedValue); | ||
// Only try to parse as JSON if it looks like a JSON object or array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b4sjoo , can you help check if this fix is backward compatible or not ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't worry to much about the bwc here since it does not related to node-node data transfer
@@ -161,6 +161,13 @@ public void testProcessRemoteInferenceInputDataSetParametersValueWithParametersP | |||
assertEquals(expectedJson, processedJson); | |||
} | |||
|
|||
@Test | |||
public void testProcessRemoteInferenceInputDataSetParametersValueWithParametersQuotedNumber() throws IOException { | |||
String json = "{\"key1\":\"foo\",\"key2\":123,\"key3\":true,\"parameters\":{\"key1\":\"123\",\"key2\":123,\"key3\":true}}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my understanding the processRemoteInferenceInputDataSetParametersValue
method was introduced to convert the string value "123"
to its correct data type 123
. If the input schema defines key1
is int
type, then the validation can pass with correct 123
value.
@b4sjoo , correct me if wrong
If that's the case, I think this fix will break some other use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my understanding the
processRemoteInferenceInputDataSetParametersValue
method was introduced to convert the string value"123"
to its correct data type123
.
Understood that this is the design...
But in this case it's converting to an incorrect data type.
If that's the case, I think this fix will break some other use case
Likely so; but we need some way to control this conversion when we have a string in the schema.
I'll flip this PR to draft while this is discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to discuss more.
One option in my mind: we can check the data type of the input schema, if the parameter type is string
, but the value is not string, we can just convert it to string value.
For example: Assume we have key1
as string
key1
value is string "123"
. For this case, we have two options:
- check data type in
processRemoteInferenceInputDataSetParametersValue
, since the value is string, we don't need to convert it to123
. - Still convert
"123"
toint
type inprocessRemoteInferenceInputDataSetParametersValue
, keep the logic as is, but we check parameter type inTransportPredictionTaskAction#validateInputSchema
method, as thekey1
isstring
type, we will not throw exception for123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still convert "123" to int type in processRemoteInferenceInputDataSetParametersValue, keep the logic as is,
This will be a bug... any string starting with numbers gets converted to just the number. So if the input string is "123 foo" it will get converted to just the number 123; even if we later convert 123 to a string we can't get the "foo" back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that sounds we need to check parameter data type defined in input schema and only convert if the data type is not string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @dbwiddis's idea to solve this issue is making sense to me, since the purpose of the original code is just to fix array and object parsing issue by remote LLM services, see:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that sounds we need to check parameter data type defined in input schema and only convert if the data type is not string.
Basec on @b4sjoo comment above, I think we may want to keep this fix but also later consider case-by-case conversion of string-to-number based on that type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original PR #2852
I see such code comment
// Process the parameters field in the input dataset to convert it back to its original datatype, instead of a string
Except "the array and object parsing issue", I think this also covers this case: the input parameter type defined as int
in input schema, but the input value is "123"
, right ? Otherwise, the input schema validation will throw exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to understand the original use case we're trying to solve here, and all of the possibilities we need to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest add more UT to cover different use cases. This method processRemoteInferenceInputDataSetParametersValue
is only being used by TransportPredictionTaskAction#validateInputSchema
method. Suggest add more UTs for validateInputSchema
method as well.
if ((textValue.startsWith("{") && textValue.endsWith("}")) | ||
|| (textValue.startsWith("[") && textValue.endsWith("]"))) { | ||
// Try to parse the string as JSON | ||
JsonNode parsedValue = mapper.readTree(textValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is going to impact boarder use cases, not only the integer within the string
// If successful, replace the string with the parsed JSON | ||
parametersNode.set(key, parsedValue); | ||
// Only try to parse as JSON if it looks like a JSON object or array | ||
if ((textValue.startsWith("{") && textValue.endsWith("}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the else branch? In some cases, the previous logic is to catch the exception and set the value. But now nothing will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Fixed.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3761 +/- ##
============================================
+ Coverage 77.98% 77.99% +0.01%
- Complexity 7316 7323 +7
============================================
Files 655 655
Lines 33037 33040 +3
Branches 3708 3709 +1
============================================
+ Hits 25764 25770 +6
+ Misses 5688 5683 -5
- Partials 1585 1587 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
parametersNode.set(key, parsedValue); | ||
// Only try to parse as JSON if it looks like a JSON object or array | ||
if ((textValue.startsWith("{") && textValue.endsWith("}")) | ||
|| (textValue.startsWith("[") && textValue.endsWith("]"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it evaluate correctly to normal string if I construct a string like "[Something] and in the middle is normal string [something]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably throw an exception and revert to the original string.
Sounds like a test case to add though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it evaluate correctly to normal string if I construct a string like "[Something] and in the middle is normal string [something]"
Looks like it may parse to an array of [Something]
(which since not in quotes would fail parsing) unless we tell it to fail on things past the value it's parsing (see FasterXML/jackson-databind#4035).
This looks very configurable but also makes a lot of assumptions, and we really do need to get a lot of test cases here to make sure we are getting what we want.
The number preceding text on ingest seems a very common case to encounter when parsing numbered list section headers.
5dd9e10
to
8399760
Compare
Hey all, I'm happy to try to see this PR to completion, but I'm really unclear about the requirements here. The initial bug cited a simple test case:
This appears to be a relatively common occurrence in actual document importing due to numbering of section headers, e.g., "1.3 Summary". The existing Jackson parsing turns any string beginning with a number into that number, so for the test case, "5.11" --> 5.11 and for my "1.3 Summary" example, it would become just 1.3. To complete this PR I really need to know what the expected input for this method is. Is it sufficient to say that if the I'd love a list of test cases and I can update this PR to resolve them. I'm just unclear on the intended use of these util methods. |
Hey @dbwiddis, I guess if we want to fix this issue once and forever we might need to find a way to skip evaluating all elements that is supposed to be a string field by the interface, just evaluate those fields which are supposed to be other type, like integer, object, array, etc.? |
Just want to brainstorm out a way to entirely fix this issue instead of enumerate edge cases in a brute force way... |
8399760
to
1b71230
Compare
I've updated the fix to prevent converting a string explicitly defined by the schema to be a string to anything else. It may not handle all possible errors and there's room to expand the validation here, but this definitely improves the current situation. WDYT? |
This PR looks good to me, have you tested with a test cluster? |
best if someone who reproduced the initial bug can test it |
Sounds good, I can approve after you rebase the code |
Signed-off-by: Daniel Widdis <widdis@gmail.com>
1b71230
to
e3ae3f0
Compare
Description
Prevents converting strings to numbers or any other validation-failing conversion, by limiting the conversion specifically to objects or arrays.
Wrote a test which failed similar to existing code by converting a string to a number, then updated the util so the test passes.
Related Issues
Resolves #3758
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.