Merge pull request #107 from marklogic/feature/custom-code-partitions

rjrudin · web-flow · commit 5c041b6cb567 · 2023-11-14T10:44:54.000-05:00
DEVEXP-627 Can now read via user-defined partitions
diff --git a/.gitignore b/.gitignore
@@ -15,3 +15,4 @@ gradle-local.properties
 logs
 .ipynb_checkpoints
 venv
+.venv
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -111,13 +111,13 @@ The following options control how the connector reads rows from MarkLogic via cu
 | spark.marklogic.read.vars. | Prefix for user-defined variables to be sent to the custom code. |
 
 If you are using Spark's streaming support with custom code, the following options can also be used to control how
-batch identifiers are defined:
+partitions are defined:
 
 | Option | Description | 
 | --- | --- |
-| spark.marklogic.read.batchIds.invoke | The path to a module to invoke; the module must be in your application's modules database. |
-| spark.marklogic.read.batchIds.javascript | JavaScript code to execute. |
-| spark.marklogic.read.batchIds.xquery | XQuery code to execute. |
+| spark.marklogic.read.partitions.invoke | The path to a module to invoke; the module must be in your application's modules database. |
+| spark.marklogic.read.partitions.javascript | JavaScript code to execute. |
+| spark.marklogic.read.partitions.xquery | XQuery code to execute. |
 
 ## Write options
 
diff --git a/docs/reading.md b/docs/reading.md
@@ -357,36 +357,38 @@ from MarkLogic can be useful when your custom code for reading data may take a l
 nature of your custom code, running the query incrementally to produce smaller batches may be a better fit for your 
 use case. 
 
+(TODO This needs to be rewritten, will do so in a follow up PR.)
+
 To stream results from your custom code, the connector must know how batches can be constructed based on the results of
 your custom code. Because the connector does not know anything about your code, the connector needs to run an 
-additional set of custom code that you implement to provide a sequence of "batch identifiers" to the connector. The
-connector will then run your custom once for each of your batch identifiers, with the batch identifier being passed as
+additional set of custom code that you implement to provide a sequence of partitions to the connector. The
+connector will then run your custom once for each of your partitions, with the partition being passed as
 an external variable to your custom code. 
 
-The code to run for providing a sequence of batch identifiers must be defined via one of the following options:
+The code to run for providing a sequence of partitions must be defined via one of the following options:
 
-- `spark.marklogic.read.batchIds.invoke` - a JavaScript or XQuery module path to invoke.
-- `spark.marklogic.read.batchIds.javascript` - a JavaScript program to evaluate.
-- `spark.marklogic.read.batchIds.xquery` - an XQuery program to evaluate.
+- `spark.marklogic.read.partitions.invoke` - a JavaScript or XQuery module path to invoke.
+- `spark.marklogic.read.partitions.javascript` - a JavaScript program to evaluate.
+- `spark.marklogic.read.partitions.xquery` - an XQuery program to evaluate.
 
 Note that any variables you define via the `spark.marklogic.reads.vars` prefix will also be sent to the above code, 
 in addition to the code you define for reading rows. 
 
-You are free to return any sequence of batch identifiers. For each one, the connector will invoke your regular custom
-code with an external variable named `BATCH_ID` of type `String`. You are then free to use this value to return 
-a set of results associated with the batch.
+You are free to return any sequence of partitions. For each one, the connector will invoke your regular custom
+code with an external variable named `PARTITION` of type `String`. You are then free to use this value to return 
+a set of results associated with the partition.
 
 The following examples illustrates how the forest IDs for the `spark-example-content` database can be used as batch
-identifiers. The custom code for returning URIs is then constrained to the value of `BATCH_ID` which will be a forest 
-ID. Spark will invoke the custom code once for each batch identifier, with the returned batch of rows being immediately 
+identifiers. The custom code for returning URIs is then constrained to the value of `PARTITION` which will be a forest 
+ID. Spark will invoke the custom code once for each partition, with the returned batch of rows being immediately 
 sent to the writer, which in this example are then printed to the console:
 
 ```
 stream = spark.readStream \
     .format("com.marklogic.spark") \
     .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
-    .option("spark.marklogic.read.batchIds.javascript", "xdmp.databaseForests(xdmp.database('spark-example-content'))") \
-    .option("spark.marklogic.read.javascript", "cts.uris(null, null, cts.collectionQuery('employee'), null, [BATCH_ID]);") \
+    .option("spark.marklogic.read.partitions.javascript", "xdmp.databaseForests(xdmp.database('spark-example-content'))") \
+    .option("spark.marklogic.read.javascript", "cts.uris(null, null, cts.collectionQuery('employee'), null, [PARTITION]);") \
     .load() \
     .writeStream \
     .format("console") \
@@ -396,8 +398,8 @@ stream.stop()
 ```
 
 For a streaming use case, you may wish to ensure that every query runs 
-[at the same point in time](https://docs.marklogic.com/guide/app-dev/point_in_time). Because you are free to construct
-any batch identifiers you wish, one technique for accomplishing this would be to construct batch identifiers 
+[at the same point in time](https://docs.marklogic.com/guide/app-dev/point_in_time). Because you are free to return
+any partitions you wish, one technique for accomplishing this would be to construct partitions
 containing both a forest ID and a server timestamp:
 
 ```
@@ -406,7 +408,7 @@ const timestamp = xdmp.requestTimestamp()
 Sequence.from(forestIds.toArray().map(forestId => forestId + ":" + timestamp))
 ```
 
-In your custom code, you would then parse out the forest ID and server timestamp from each batch identifier and use
+In your custom code, you would then parse out the forest ID and server timestamp from each partition and use
 them accordingly in your queries. The MarkLogic documentation in the link above can provide more details and examples
 on how to perform point-in-time queries with server timestamps.
 
diff --git a/docs/writing.md b/docs/writing.md
@@ -329,8 +329,8 @@ import tempfile
 stream = spark.readStream \
     .format("com.marklogic.spark") \
     .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
-    .option("spark.marklogic.read.batchIds.javascript", "xdmp.databaseForests(xdmp.database('spark-example-content'))") \
-    .option("spark.marklogic.read.javascript", "cts.uris(null, ['limit=10'], cts.collectionQuery('employee'), null, [BATCH_ID]);") \
+    .option("spark.marklogic.read.partitions.javascript", "xdmp.databaseForests(xdmp.database('spark-example-content'))") \
+    .option("spark.marklogic.read.javascript", "cts.uris(null, ['limit=10'], cts.collectionQuery('employee'), null, [PARTITION]);") \
     .load() \
     .writeStream \
     .format("com.marklogic.spark") \
diff --git a/src/main/java/com/marklogic/spark/Options.java b/src/main/java/com/marklogic/spark/Options.java
@@ -25,9 +25,9 @@ public interface Options {
     String READ_XQUERY = "spark.marklogic.read.xquery";
     String READ_VARS_PREFIX = "spark.marklogic.read.vars.";
 
-    String READ_BATCH_IDS_INVOKE = "spark.marklogic.read.batchIds.invoke";
-    String READ_BATCH_IDS_JAVASCRIPT = "spark.marklogic.read.batchIds.javascript";
-    String READ_BATCH_IDS_XQUERY = "spark.marklogic.read.batchIds.xquery";
+    String READ_PARTITIONS_INVOKE = "spark.marklogic.read.partitions.invoke";
+    String READ_PARTITIONS_JAVASCRIPT = "spark.marklogic.read.partitions.javascript";
+    String READ_PARTITIONS_XQUERY = "spark.marklogic.read.partitions.xquery";
 
     String READ_OPTIC_QUERY = "spark.marklogic.read.opticQuery";
     String READ_NUM_PARTITIONS = "spark.marklogic.read.numPartitions";
diff --git a/src/main/java/com/marklogic/spark/reader/CustomCodeBatch.java b/src/main/java/com/marklogic/spark/reader/CustomCodeBatch.java
@@ -5,20 +5,30 @@
 import org.apache.spark.sql.connector.read.InputPartition;
 import org.apache.spark.sql.connector.read.PartitionReaderFactory;
 
+import java.util.List;
+
 class CustomCodeBatch implements Batch {
 
     private CustomCodeContext customCodeContext;
+    private List<String> partitions;
 
-    public CustomCodeBatch(CustomCodeContext customCodeContext) {
+    public CustomCodeBatch(CustomCodeContext customCodeContext, List<String> partitions) {
         this.customCodeContext = customCodeContext;
+        this.partitions = partitions;
     }
 
     @Override
     public InputPartition[] planInputPartitions() {
-        // We don't yet support partitioning a user's custom code. In the future, we may support this by passing along
-        // e.g. host and/or forest names, though the burden would then be on the user to utilize those correctly in
-        // their custom code.
-        return new InputPartition[]{new CustomCodePartition()};
+        InputPartition[] inputPartitions;
+        if (partitions != null && partitions.size() > 1) {
+            inputPartitions = new InputPartition[partitions.size()];
+            for (int i = 0; i < partitions.size(); i++) {
+                inputPartitions[i] = new CustomCodePartition(partitions.get(i));
+            }
+        } else {
+            inputPartitions = new InputPartition[]{new CustomCodePartition()};
+        }
+        return inputPartitions;
     }
 
     @Override
diff --git a/src/main/java/com/marklogic/spark/reader/CustomCodeMicroBatchStream.java b/src/main/java/com/marklogic/spark/reader/CustomCodeMicroBatchStream.java
@@ -1,8 +1,6 @@
 package com.marklogic.spark.reader;
 
-import com.marklogic.client.DatabaseClient;
 import com.marklogic.spark.CustomCodeContext;
-import com.marklogic.spark.Options;
 import org.apache.spark.sql.connector.read.InputPartition;
 import org.apache.spark.sql.connector.read.PartitionReaderFactory;
 import org.apache.spark.sql.connector.read.streaming.MicroBatchStream;
@@ -11,63 +9,46 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-import java.util.ArrayList;
 import java.util.List;
 
 class CustomCodeMicroBatchStream implements MicroBatchStream {
 
     private final static Logger logger = LoggerFactory.getLogger(CustomCodeMicroBatchStream.class);
 
     private final CustomCodeContext customCodeContext;
-    private long batchIndex = 0;
-    private final List<String> batchIds = new ArrayList<>();
+    private final List<String> partitions;
+    private long partitionIndex = 0;
 
-    /**
-     * Invokes the user-defined option for retrieving batch IDs. The list of batch IDs is stored so that it can be
-     * iterated over via the methods in MicroBatchStream.
-     *
-     * @param customCodeContext
-     */
-    CustomCodeMicroBatchStream(CustomCodeContext customCodeContext) {
+    CustomCodeMicroBatchStream(CustomCodeContext customCodeContext, List<String> partitions) {
         this.customCodeContext = customCodeContext;
-        DatabaseClient client = this.customCodeContext.connectToMarkLogic();
-        try {
-            this.customCodeContext
-                .buildCall(client, new CustomCodeContext.CallOptions(
-                    Options.READ_BATCH_IDS_INVOKE, Options.READ_BATCH_IDS_JAVASCRIPT, Options.READ_BATCH_IDS_XQUERY
-                ))
-                .eval()
-                .forEach(result -> batchIds.add(result.getString()));
-        } finally {
-            client.release();
-        }
+        this.partitions = partitions;
     }
 
     /**
      * Invoked by Spark to get the next offset for which it should construct a reader; an offset for this class is
-     * equivalent to a batch ID.
+     * equivalent to a user-defined partition.
      *
      * @return
      */
     @Override
     public Offset latestOffset() {
-        Offset result = batchIndex >= batchIds.size() ? null : new LongOffset(batchIndex);
+        Offset result = partitionIndex >= partitions.size() ? null : new LongOffset(partitionIndex);
         if (logger.isTraceEnabled()) {
-            logger.trace("Returning latest offset: {}", batchIndex);
+            logger.trace("Returning latest offset: {}", partitionIndex);
         }
-        batchIndex++;
+        partitionIndex++;
         return result;
     }
 
     /**
      * @param start
      * @param end
-     * @return a partition associated with the latest batch ID, which is captured by the "end" offset.
+     * @return a partition associated with the latest partition, which is captured by the "end" offset.
      */
     @Override
     public InputPartition[] planInputPartitions(Offset start, Offset end) {
         long index = ((LongOffset) end).offset();
-        return new InputPartition[]{new CustomCodePartition(batchIds.get((int) index))};
+        return new InputPartition[]{new CustomCodePartition(partitions.get((int) index))};
     }
 
     @Override
diff --git a/src/main/java/com/marklogic/spark/reader/CustomCodePartition.java b/src/main/java/com/marklogic/spark/reader/CustomCodePartition.java
@@ -8,25 +8,16 @@ class CustomCodePartition implements InputPartition, Serializable {
 
     final static long serialVersionUID = 1;
 
-    private String batchId;
+    private String partition;
 
-    /**
-     * Constructor for normal reading, where all rows will be returned in a single call to MarkLogic by a single reader.
-     */
     public CustomCodePartition() {
     }
 
-    /**
-     * Constructor used for streaming reads, when a call is made to the reader (and thus to MarkLogic) for the given
-     * batch ID.
-     *
-     * @param batchId
-     */
-    public CustomCodePartition(String batchId) {
-        this.batchId = batchId;
+    public CustomCodePartition(String partition) {
+        this.partition = partition;
     }
 
-    public String getBatchId() {
-        return batchId;
+    public String getPartition() {
+        return partition;
     }
 }
diff --git a/src/main/java/com/marklogic/spark/reader/CustomCodePartitionReader.java b/src/main/java/com/marklogic/spark/reader/CustomCodePartitionReader.java
@@ -19,16 +19,15 @@ class CustomCodePartitionReader implements PartitionReader {
     private final JsonRowDeserializer jsonRowDeserializer;
     private final DatabaseClient databaseClient;
 
-    public CustomCodePartitionReader(CustomCodeContext customCodeContext, String batchId) {
+    public CustomCodePartitionReader(CustomCodeContext customCodeContext, String partition) {
         this.databaseClient = customCodeContext.connectToMarkLogic();
         this.serverEvaluationCall = customCodeContext.buildCall(
             this.databaseClient,
             new CustomCodeContext.CallOptions(Options.READ_INVOKE, Options.READ_JAVASCRIPT, Options.READ_XQUERY)
         );
 
-        // For streaming support.
-        if (batchId != null && batchId.trim().length() > 0) {
-            this.serverEvaluationCall.addVariable("BATCH_ID", batchId);
+        if (partition != null) {
+            this.serverEvaluationCall.addVariable("PARTITION", partition);
         }
 
         this.isCustomSchema = customCodeContext.isCustomSchema();
diff --git a/src/main/java/com/marklogic/spark/reader/CustomCodePartitionReaderFactory.java b/src/main/java/com/marklogic/spark/reader/CustomCodePartitionReaderFactory.java
@@ -16,6 +16,6 @@ public CustomCodePartitionReaderFactory(CustomCodeContext customCodeContext) {
 
     @Override
     public PartitionReader<InternalRow> createReader(InputPartition partition) {
-        return new CustomCodePartitionReader(customCodeContext, ((CustomCodePartition) partition).getBatchId());
+        return new CustomCodePartitionReader(customCodeContext, ((CustomCodePartition) partition).getPartition());
     }
 }
diff --git a/src/main/java/com/marklogic/spark/reader/CustomCodeScan.java b/src/main/java/com/marklogic/spark/reader/CustomCodeScan.java
@@ -1,17 +1,40 @@
 package com.marklogic.spark.reader;
 
+import com.marklogic.client.DatabaseClient;
 import com.marklogic.spark.CustomCodeContext;
+import com.marklogic.spark.Options;
 import org.apache.spark.sql.connector.read.Batch;
 import org.apache.spark.sql.connector.read.Scan;
 import org.apache.spark.sql.connector.read.streaming.MicroBatchStream;
 import org.apache.spark.sql.types.StructType;
 
+import java.util.ArrayList;
+import java.util.List;
+
 class CustomCodeScan implements Scan {
 
     private CustomCodeContext customCodeContext;
+    private final List<String> partitions;
 
     public CustomCodeScan(CustomCodeContext customCodeContext) {
         this.customCodeContext = customCodeContext;
+        this.partitions = new ArrayList<>();
+
+        if (this.customCodeContext.hasOption(Options.READ_PARTITIONS_INVOKE, Options.READ_PARTITIONS_JAVASCRIPT, Options.READ_PARTITIONS_XQUERY)) {
+            DatabaseClient client = this.customCodeContext.connectToMarkLogic();
+            try {
+                this.customCodeContext
+                    .buildCall(client, new CustomCodeContext.CallOptions(
+                        Options.READ_PARTITIONS_INVOKE, Options.READ_PARTITIONS_JAVASCRIPT, Options.READ_PARTITIONS_XQUERY
+                    ))
+                    .eval()
+                    .forEach(result -> this.partitions.add(result.getString()));
+            } catch (Exception ex) {
+                throw new RuntimeException("Unable to retrieve partitions", ex);
+            } finally {
+                client.release();
+            }
+        }
     }
 
     @Override
@@ -21,11 +44,11 @@ public StructType readSchema() {
 
     @Override
     public Batch toBatch() {
-        return new CustomCodeBatch(customCodeContext);
+        return new CustomCodeBatch(customCodeContext, partitions);
     }
 
     @Override
     public MicroBatchStream toMicroBatchStream(String checkpointLocation) {
-        return new CustomCodeMicroBatchStream(customCodeContext);
+        return new CustomCodeMicroBatchStream(customCodeContext, partitions);
     }
 }
diff --git a/src/test/java/com/marklogic/spark/reader/ReadStreamWithCustomCodeTest.java b/src/test/java/com/marklogic/spark/reader/ReadStreamWithCustomCodeTest.java
diff --git a/src/test/java/com/marklogic/spark/reader/ReadWithCustomCodeTest.java b/src/test/java/com/marklogic/spark/reader/ReadWithCustomCodeTest.java
diff --git a/src/test/java/com/marklogic/spark/writer/ProcessStreamWithCustomCodeTest.java b/src/test/java/com/marklogic/spark/writer/ProcessStreamWithCustomCodeTest.java
diff --git a/src/test/ml-modules/root/getForests.sjs b/src/test/ml-modules/root/getForests.sjs

-Original file line number
+Diff line change
 logs
 .ipynb_checkpoints
 venv
 +.venv
Original file line number	Diff line number	Diff line change
`@@ -16,6 +16,6 @@ public CustomCodePartitionReaderFactory(CustomCodeContext customCodeContext) {`
`16`	`16`
`17`	`17`	`@Override`
`18`	`18`	`public PartitionReader<InternalRow> createReader(InputPartition partition) {`
`19`		`- return new CustomCodePartitionReader(customCodeContext, ((CustomCodePartition) partition).getBatchId());`
	`19`	`+ return new CustomCodePartitionReader(customCodeContext, ((CustomCodePartition) partition).getPartition());`
`20`	`20`	`}`
`21`	`21`	`}`