MLE-22889 Not disabling stemming for JSON #519

rjrudin · 2025-07-18T18:04:57Z

See the note about MLE-22918 - we don't want to produce data by default that results in buggy behavior.

Copilot

Pull Request Overview

This PR removes the automatic setting of the "language" field to "zxx" for JSON chunks when base64 encoding vectors, due to a bug (MLE-22918) where the language field affects stemming beyond its intended scope.

Key changes:

Removes automatic stemming disabling for JSON chunks with base64-encoded vectors
Updates test expectations to reflect that the "lang" field is no longer set

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
JsonChunk.java	Removes the automatic setting of "language" field to "zxx" and adds explanatory comment about MLE-22918 bug
AddEmbeddingsToJsonTest.java	Updates test to verify that "lang" field is not set and includes detailed explanation

Copilot · 2025-07-18T18:05:23Z

marklogic-spark-api/src/main/java/com/marklogic/spark/core/embedding/JsonChunk.java

@@ -37,8 +37,9 @@ public void addEmbedding(float[] embedding) {
        if (base64EncodeVectors) {
            String base64Vector = VectorUtil.base64Encode(embedding);
            chunk.put(this.embeddingArrayName, base64Vector);
-            // Add language as a top-level property to disable stemming in MarkLogic
-            chunk.put("language", "zxx");
+            // Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs,


The comment references 'lang' field but the original code was setting 'language' field. This inconsistency in field naming within the comment could cause confusion.

Suggested change

// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs,

// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "embeddingArrayName". Per server docs,

Copilot · 2025-07-18T18:05:24Z

tests/src/test/java/com/marklogic/spark/writer/embedding/AddEmbeddingsToJsonTest.java

+            assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " +
+                "will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " +
+                "e.g. use a REST transform to add this if desired.");


The test is checking for 'lang' field but the original code was setting 'language' field. The test should verify that neither 'lang' nor 'language' fields are present, or clarify which field name is being tested.

Suggested change

assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " +

"will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " +

"e.g. use a REST transform to add this if desired.");

assertFalse(chunk.has("lang") || chunk.has("language"), "Due to MLE-22918, neither the 'lang' field " +

"nor the 'language' field should be set. Setting 'lang' to 'zxx' disables stemming outside the " +

"intended scope, and 'language' is not expected in this context. A user is free to e.g. use a REST " +

"transform to add these fields if desired.");

MLE-22889 Not disabling stemming for JSON

d3deda9

See the note about MLE-22918 - we don't want to produce data by default that results in buggy behavior.

Copilot AI review requested due to automatic review settings July 18, 2025 18:04

rjrudin requested review from anu3990, BillFarber and stevebio as code owners July 18, 2025 18:04

Copilot AI reviewed Jul 18, 2025

View reviewed changes

BillFarber approved these changes Jul 18, 2025

View reviewed changes

rjrudin merged commit ba227c7 into develop Jul 18, 2025
1 of 2 checks passed

rjrudin deleted the feature/22889-lang-tweak branch July 18, 2025 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MLE-22889 Not disabling stemming for JSON #519

MLE-22889 Not disabling stemming for JSON #519

Uh oh!

rjrudin commented Jul 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 18, 2025

Uh oh!

Copilot AI Jul 18, 2025

Uh oh!

Uh oh!

Uh oh!

	// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "lang". Per server docs,
	// Stemming is not disabled - see MLE-22918 about a bug pertaining to the scope of "embeddingArrayName". Per server docs,

-            assertFalse(chunk.has("lang"), "Due to MLE-22918, the 'lang' field is not set to 'zxx' since this " +
-                "will disable stemming on data outside the intended scope of the 'lang' field. A user is free to " +
-                "e.g. use a REST transform to add this if desired.");
+            assertFalse(chunk.has("lang") || chunk.has("language"), "Due to MLE-22918, neither the 'lang' field " +
+                "nor the 'language' field should be set. Setting 'lang' to 'zxx' disables stemming outside the " +
+                "intended scope, and 'language' is not expected in this context. A user is free to e.g. use a REST " +
+                "transform to add these fields if desired.");

MLE-22889 Not disabling stemming for JSON #519

MLE-22889 Not disabling stemming for JSON #519

Uh oh!

Conversation

rjrudin commented Jul 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!