Skip to content

MLE-22889 Not disabling stemming for JSON #519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 18, 2025
Merged

Conversation

rjrudin
Copy link
Contributor

@rjrudin rjrudin commented Jul 18, 2025

See the note about MLE-22918 - we don't want to produce data by default that results in buggy behavior.

See the note about MLE-22918 - we don't want to produce data by default that results in buggy behavior.
@Copilot Copilot AI review requested due to automatic review settings July 18, 2025 18:04
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the automatic setting of the "language" field to "zxx" for JSON chunks when base64 encoding vectors, due to a bug (MLE-22918) where the language field affects stemming beyond its intended scope.

Key changes:

  • Removes automatic stemming disabling for JSON chunks with base64-encoded vectors
  • Updates test expectations to reflect that the "lang" field is no longer set

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
JsonChunk.java Removes the automatic setting of "language" field to "zxx" and adds explanatory comment about MLE-22918 bug
AddEmbeddingsToJsonTest.java Updates test to verify that "lang" field is not set and includes detailed explanation

@@ -37,8 +37,9 @@ public void addEmbedding(float[] embedding) {
if (base64EncodeVectors) {
String base64Vector = VectorUtil.base64Encode(embedding);
chunk.put(this.embeddingArrayName, base64Vector);
// Add language as a top-level property to disable stemming in MarkLogic
chunk.put("language", "zxx");
// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs,
Copy link
Preview

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references 'lang' field but the original code was setting 'language' field. This inconsistency in field naming within the comment could cause confusion.

Suggested change
// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs,
// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "embeddingArrayName". Per server docs,

Copilot uses AI. Check for mistakes.

Comment on lines +468 to +470
assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " +
"will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " +
"e.g. use a REST transform to add this if desired.");
Copy link
Preview

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is checking for 'lang' field but the original code was setting 'language' field. The test should verify that neither 'lang' nor 'language' fields are present, or clarify which field name is being tested.

Suggested change
assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " +
"will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " +
"e.g. use a REST transform to add this if desired.");
assertFalse(chunk.has("lang") || chunk.has("language"), "Due to MLE-22918, neither the 'lang' field " +
"nor the 'language' field should be set. Setting 'lang' to 'zxx' disables stemming outside the " +
"intended scope, and 'language' is not expected in this context. A user is free to e.g. use a REST " +
"transform to add these fields if desired.");

Copilot uses AI. Check for mistakes.

@rjrudin rjrudin merged commit ba227c7 into develop Jul 18, 2025
1 of 2 checks passed
@rjrudin rjrudin deleted the feature/22889-lang-tweak branch July 18, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants