-
Notifications
You must be signed in to change notification settings - Fork 6
MLE-22889 Not disabling stemming for JSON #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
See the note about MLE-22918 - we don't want to produce data by default that results in buggy behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the automatic setting of the "language" field to "zxx" for JSON chunks when base64 encoding vectors, due to a bug (MLE-22918) where the language field affects stemming beyond its intended scope.
Key changes:
- Removes automatic stemming disabling for JSON chunks with base64-encoded vectors
- Updates test expectations to reflect that the "lang" field is no longer set
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
JsonChunk.java | Removes the automatic setting of "language" field to "zxx" and adds explanatory comment about MLE-22918 bug |
AddEmbeddingsToJsonTest.java | Updates test to verify that "lang" field is not set and includes detailed explanation |
@@ -37,8 +37,9 @@ public void addEmbedding(float[] embedding) { | |||
if (base64EncodeVectors) { | |||
String base64Vector = VectorUtil.base64Encode(embedding); | |||
chunk.put(this.embeddingArrayName, base64Vector); | |||
// Add language as a top-level property to disable stemming in MarkLogic | |||
chunk.put("language", "zxx"); | |||
// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment references 'lang' field but the original code was setting 'language' field. This inconsistency in field naming within the comment could cause confusion.
// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs, | |
// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "embeddingArrayName". Per server docs, |
Copilot uses AI. Check for mistakes.
assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " + | ||
"will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " + | ||
"e.g. use a REST transform to add this if desired."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is checking for 'lang' field but the original code was setting 'language' field. The test should verify that neither 'lang' nor 'language' fields are present, or clarify which field name is being tested.
assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " + | |
"will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " + | |
"e.g. use a REST transform to add this if desired."); | |
assertFalse(chunk.has("lang") || chunk.has("language"), "Due to MLE-22918, neither the 'lang' field " + | |
"nor the 'language' field should be set. Setting 'lang' to 'zxx' disables stemming outside the " + | |
"intended scope, and 'language' is not expected in this context. A user is free to e.g. use a REST " + | |
"transform to add these fields if desired."); |
Copilot uses AI. Check for mistakes.
See the note about MLE-22918 - we don't want to produce data by default that results in buggy behavior.