Skip to content

Don't convert schema-defined strings to other types during validation #3761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dbwiddis
Copy link
Member

Description

Prevents converting strings to numbers or any other validation-failing conversion, by limiting the conversion specifically to objects or arrays.

Wrote a test which failed similar to existing code by converting a string to a number, then updated the util so the test passes.

Related Issues

Resolves #3758

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

JsonNode parsedValue = mapper.readTree(textValue);
// If successful, replace the string with the parsed JSON
parametersNode.set(key, parsedValue);
// Only try to parse as JSON if it looks like a JSON object or array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b4sjoo , can you help check if this fix is backward compatible or not ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't worry to much about the bwc here since it does not related to node-node data transfer

@@ -161,6 +161,13 @@ public void testProcessRemoteInferenceInputDataSetParametersValueWithParametersP
assertEquals(expectedJson, processedJson);
}

@Test
public void testProcessRemoteInferenceInputDataSetParametersValueWithParametersQuotedNumber() throws IOException {
String json = "{\"key1\":\"foo\",\"key2\":123,\"key3\":true,\"parameters\":{\"key1\":\"123\",\"key2\":123,\"key3\":true}}";
Copy link
Collaborator

@ylwu-amzn ylwu-amzn Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my understanding the processRemoteInferenceInputDataSetParametersValue method was introduced to convert the string value "123" to its correct data type 123. If the input schema defines key1 is int type, then the validation can pass with correct 123 value.

@b4sjoo , correct me if wrong

If that's the case, I think this fix will break some other use case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my understanding the processRemoteInferenceInputDataSetParametersValue method was introduced to convert the string value "123" to its correct data type 123.

Understood that this is the design...

But in this case it's converting to an incorrect data type.

If that's the case, I think this fix will break some other use case

Likely so; but we need some way to control this conversion when we have a string in the schema.

I'll flip this PR to draft while this is discussed.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to discuss more.
One option in my mind: we can check the data type of the input schema, if the parameter type is string, but the value is not string, we can just convert it to string value.

For example: Assume we have key1 as string

key1 value is string "123". For this case, we have two options:

  • check data type in processRemoteInferenceInputDataSetParametersValue, since the value is string, we don't need to convert it to 123.
  • Still convert "123" to int type in processRemoteInferenceInputDataSetParametersValue, keep the logic as is, but we check parameter type in TransportPredictionTaskAction#validateInputSchema method, as the key1 is string type, we will not throw exception for 123

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still convert "123" to int type in processRemoteInferenceInputDataSetParametersValue, keep the logic as is,

This will be a bug... any string starting with numbers gets converted to just the number. So if the input string is "123 foo" it will get converted to just the number 123; even if we later convert 123 to a string we can't get the "foo" back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that sounds we need to check parameter data type defined in input schema and only convert if the data type is not string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @dbwiddis's idea to solve this issue is making sense to me, since the purpose of the original code is just to fix array and object parsing issue by remote LLM services, see:

  1. Fix cohere model input interface cannot validate cohere input issue #2847
  2. Add processed function for remote inference input dataset parameters to convert it back to its orignal datatype #2852

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that sounds we need to check parameter data type defined in input schema and only convert if the data type is not string.

Basec on @b4sjoo comment above, I think we may want to keep this fix but also later consider case-by-case conversion of string-to-number based on that type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original PR #2852

I see such code comment

 // Process the parameters field in the input dataset to convert it back to its original datatype, instead of a string

Except "the array and object parsing issue", I think this also covers this case: the input parameter type defined as int in input schema, but the input value is "123", right ? Otherwise, the input schema validation will throw exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to understand the original use case we're trying to solve here, and all of the possibilities we need to address.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest add more UT to cover different use cases. This method processRemoteInferenceInputDataSetParametersValue is only being used by TransportPredictionTaskAction#validateInputSchema method. Suggest add more UTs for validateInputSchema method as well.

@dbwiddis dbwiddis marked this pull request as draft April 23, 2025 17:00
if ((textValue.startsWith("{") && textValue.endsWith("}"))
|| (textValue.startsWith("[") && textValue.endsWith("]"))) {
// Try to parse the string as JSON
JsonNode parsedValue = mapper.readTree(textValue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is going to impact boarder use cases, not only the integer within the string

// If successful, replace the string with the parsed JSON
parametersNode.set(key, parsedValue);
// Only try to parse as JSON if it looks like a JSON object or array
if ((textValue.startsWith("{") && textValue.endsWith("}"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the else branch? In some cases, the previous logic is to catch the exception and set the value. But now nothing will happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

Copy link

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 77.99%. Comparing base (c3b1fd6) to head (e3ae3f0).

Files with missing lines Patch % Lines
...main/java/org/opensearch/ml/utils/MLNodeUtils.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3761      +/-   ##
============================================
+ Coverage     77.98%   77.99%   +0.01%     
- Complexity     7316     7323       +7     
============================================
  Files           655      655              
  Lines         33037    33040       +3     
  Branches       3708     3709       +1     
============================================
+ Hits          25764    25770       +6     
+ Misses         5688     5683       -5     
- Partials       1585     1587       +2     
Flag Coverage Δ
ml-commons 77.99% <85.71%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbwiddis dbwiddis requested a deployment to ml-commons-cicd-env-require-approval April 23, 2025 17:48 — with GitHub Actions Waiting
@dbwiddis dbwiddis requested a deployment to ml-commons-cicd-env-require-approval April 23, 2025 17:48 — with GitHub Actions Waiting
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval April 23, 2025 18:34 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval April 23, 2025 18:34 — with GitHub Actions Inactive
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval April 23, 2025 18:34 — with GitHub Actions Error
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval April 23, 2025 18:34 — with GitHub Actions Failure
parametersNode.set(key, parsedValue);
// Only try to parse as JSON if it looks like a JSON object or array
if ((textValue.startsWith("{") && textValue.endsWith("}"))
|| (textValue.startsWith("[") && textValue.endsWith("]"))) {
Copy link
Collaborator

@b4sjoo b4sjoo Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it evaluate correctly to normal string if I construct a string like "[Something] and in the middle is normal string [something]"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably throw an exception and revert to the original string.

Sounds like a test case to add though!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it evaluate correctly to normal string if I construct a string like "[Something] and in the middle is normal string [something]"

Looks like it may parse to an array of [Something] (which since not in quotes would fail parsing) unless we tell it to fail on things past the value it's parsing (see FasterXML/jackson-databind#4035).

This looks very configurable but also makes a lot of assumptions, and we really do need to get a lot of test cases here to make sure we are getting what we want.

The number preceding text on ingest seems a very common case to encounter when parsing numbered list section headers.

@dbwiddis dbwiddis force-pushed the no-string-to-number branch from 5dd9e10 to 8399760 Compare May 6, 2025 01:57
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 6, 2025 01:59 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 6, 2025 01:59 — with GitHub Actions Inactive
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval May 6, 2025 01:59 — with GitHub Actions Error
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval May 6, 2025 01:59 — with GitHub Actions Failure
@dbwiddis dbwiddis requested a deployment to ml-commons-cicd-env-require-approval May 6, 2025 17:02 — with GitHub Actions Waiting
@dbwiddis dbwiddis requested a deployment to ml-commons-cicd-env-require-approval May 6, 2025 17:02 — with GitHub Actions Waiting
@dbwiddis
Copy link
Member Author

dbwiddis commented May 7, 2025

Hey all, I'm happy to try to see this PR to completion, but I'm really unclear about the requirements here.

The initial bug cited a simple test case:

POST /_plugins/_ml/models/A9y_EpYBMfB3_phs7_V5/_predict
{
  "parameters": {
    "inputText" : "5.11"
  }
}

This appears to be a relatively common occurrence in actual document importing due to numbering of section headers, e.g., "1.3 Summary". The existing Jackson parsing turns any string beginning with a number into that number, so for the test case, "5.11" --> 5.11 and for my "1.3 Summary" example, it would become just 1.3.

To complete this PR I really need to know what the expected input for this method is. Is it sufficient to say that if the "inputText" field in "parameters" is defined as a "string" type, it should be left alone? Should it be converted to an object? Should we restrict the Jackson conversions to not change any types (e.g., never convert "5.11" to 5.11?).

I'd love a list of test cases and I can update this PR to resolve them. I'm just unclear on the intended use of these util methods.

CC: @b4sjoo @ylwu-amzn @mingshl

@b4sjoo
Copy link
Collaborator

b4sjoo commented May 9, 2025

Hey @dbwiddis, I guess if we want to fix this issue once and forever we might need to find a way to skip evaluating all elements that is supposed to be a string field by the interface, just evaluate those fields which are supposed to be other type, like integer, object, array, etc.?

@b4sjoo
Copy link
Collaborator

b4sjoo commented May 9, 2025

Just want to brainstorm out a way to entirely fix this issue instead of enumerate edge cases in a brute force way...

@dbwiddis dbwiddis force-pushed the no-string-to-number branch from 8399760 to 1b71230 Compare May 9, 2025 15:28
@dbwiddis dbwiddis changed the title Don't convert strings to numbers in schema validation Don't convert schema-defined strings to other types during validation May 9, 2025
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval May 9, 2025 15:29 — with GitHub Actions Failure
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval May 9, 2025 15:29 — with GitHub Actions Error
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval May 9, 2025 15:29 — with GitHub Actions Failure
@dbwiddis dbwiddis had a problem deploying to ml-commons-cicd-env-require-approval May 9, 2025 15:29 — with GitHub Actions Error
@dbwiddis dbwiddis marked this pull request as ready for review May 9, 2025 15:31
@dbwiddis
Copy link
Member Author

dbwiddis commented May 9, 2025

I guess if we want to fix this issue once and forever we might need to find a way to skip evaluating all elements that is supposed to be a string field by the interface

I've updated the fix to prevent converting a string explicitly defined by the schema to be a string to anything else.

It may not handle all possible errors and there's room to expand the validation here, but this definitely improves the current situation. WDYT?

CC: @ylwu-amzn @mingshl @b4sjoo

@b4sjoo
Copy link
Collaborator

b4sjoo commented May 13, 2025

This PR looks good to me, have you tested with a test cluster?

@dbwiddis
Copy link
Member Author

best if someone who reproduced the initial bug can test it

@b4sjoo
Copy link
Collaborator

b4sjoo commented May 13, 2025

Sounds good, I can approve after you rebase the code

Signed-off-by: Daniel Widdis <widdis@gmail.com>
@dbwiddis dbwiddis force-pushed the no-string-to-number branch from 1b71230 to e3ae3f0 Compare May 13, 2025 21:13
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 13, 2025 21:15 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 13, 2025 21:15 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 13, 2025 21:15 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 13, 2025 21:15 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 13, 2025 22:33 — with GitHub Actions Inactive
@dbwiddis dbwiddis temporarily deployed to ml-commons-cicd-env-require-approval May 13, 2025 22:33 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] model interface validation failed when there is integer within text
5 participants