Merge pull request #75 from marklogic/feature/488-batch-size

rjrudin · web-flow · commit e45ed4d7a409 · 2023-06-13T14:14:39.000-07:00
DEVEXP-488 Setting batch size to zero when pushing down aggregate
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -85,8 +85,8 @@ information on how data is read from MarkLogic.
 | Option | Description                                                                                       | 
 | --- |---------------------------------------------------------------------------------------------------|
 | spark.marklogic.read.opticQuery | Required; the Optic DSL query to run for retrieving rows; must use `op.fromView` as the accessor. |
-| spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to `spark.default.parallelism` .               |
-| spark.marklogic.read.batchSize | Approximate number of rows to retrieve in each call to MarkLogic; defaults to 10000.              |
+| spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to `spark.default.parallelism`. |
+| spark.marklogic.read.batchSize | Approximate number of rows to retrieve in each call to MarkLogic; defaults to 100000. |
 | spark.marklogic.read.pushDownAggregates | Whether to push down aggregate operations to MarkLogic; defaults to `true`. Set to `false` to prevent aggregates from being pushed down to MarkLogic. |
 ## Write options
 
diff --git a/docs/reading.md b/docs/reading.md
@@ -8,7 +8,23 @@ The MarkLogic Spark connector allows for data to be retrieved from MarkLogic as
 [Optic DSL query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710). The 
 sections below provide more detail on configuring how data is retrieved and converted into a Spark DataFrame.
 
-## Query requirements
+## Basic read operation
+
+As shown in the [Getting Started with PySpark guide](getting-started/pyspark.md), a basic read operation will define
+how the connector should connect to MarkLogic, the MarkLogic Optic query to run, and zero or more other options:
+
+```
+df = spark.read.format("com.marklogic.spark") \
+    .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8020") \
+    .option("spark.marklogic.read.opticQuery", "op.fromView('example', 'employee')") \
+    .load()
+```
+
+As shown above, `format`, `spark.marklogic.client.uri` (or the other `spark.marklogic.client` options
+that can be used to define the connection details), and `spark.marklogic.read.opticQuery` are always required. The 
+following sections provide more details about these and other options that can be set. 
+
+## Optic query requirements
 
 As of the 2.0 release of the connector, the Optic query must use the 
 [op.fromView](https://docs.marklogic.com/op.fromView) accessor function. The query must also adhere to the 
@@ -87,7 +103,7 @@ stream.stop()
 Micro-batches are constructed based on the number of partitions and user-defined batch size; more information on each
 setting can be found in section below on tuning performance. Each request to MarkLogic that is made in "batch read"
 mode - i.e. when using Spark's `read` function instead of `readStream` - corresponds to a micro-batch when reading
-data via a stream. In the example above, which uses the connector's default batch size of 10,000 rows and 2 
+data via a stream. In the example above, which uses the connector's default batch size of 100,000 rows and 2 
 partitions, 2 calls are made to MarkLogic, resulting in two micro-batches. 
 
 The number of micro-batches can be determined by enabling info-level logging and looking for a message similar to:
@@ -169,40 +185,46 @@ correct result, please [file an issue with this project](https://github.com/mark
 
 ## Tuning performance
 
-The primary factor affecting how quickly the connector can retrieve rows is MarkLogic's ability to process your Optic 
-query. The 
-[MarkLogic Optic performance documentation](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_91398) can help with 
-optimizing your query to maximize performance. 
+The primary factor affecting connector performance when reading rows is how many requests are made to MarkLogic. In 
+general, performance will be best when minimizing the number of requests to MarkLogic while ensuring that no single 
+request attempts to return or process too much data. 
 
-Two [configuration options](configuration.md) in the connector will also impact performance. First, the 
+Two [configuration options](configuration.md) control how many requests are made. First, the 
 `spark.marklogic.read.numPartitions` option controls how many partitions are created. For each partition, Spark 
 will use a separate task to send requests to MarkLogic to retrieve rows matching your Optic DSL query. Second, the 
 `spark.marklogic.read.batchSize` option controls approximately how many rows will be retrieved in each call to 
 MarkLogic. 
 
-These two options impact each other in terms of how many tasks are used to make requests to MarkLogic. For example, 
-consider an Optic query that matches 1 million rows in MarkLogic, a partition count of 10, and a batch size of 
-10,000 rows (the default value). This configuration will result in the connector creating 10 Spark partition readers,
-each of which will retrieve approximately 100,000 unique rows. And with a batch size of 10,000, each partition 
+To understand how these options control the number of requests to MarkLogic, 
+consider an Optic query that matches 10 million rows in MarkLogic, a partition count of 10, and a batch size of 
+100,000 rows (the default value). This configuration will result in the connector creating 10 Spark partition readers,
+each of which will retrieve approximately 1,000,000 unique rows. And with a batch size of 100,000, each partition 
 reader will make approximately 10 calls to MarkLogic to retrieve these rows, for a total of 100 calls across all 
-partitions. 
+partitions.  
 
-Performance can thus be tested by varying the number of partitions and the batch size. In general, increasing the 
-number of partitions should help performance as the number of matching rows increases. A single partition may suffice 
-for a query that returns thousands of rows or fewer, while a query that returns hundreds of millions of rows will 
-benefit from dozens of partitions or more. The ideal settings will depend on your Spark and MarkLogic environments 
-along with the complexity of your Optic query. Testing should be performed with different queries, partition counts, 
-and batch sizes to determine the optimal settings.
+Performance should be tested by varying the number of partitions and the batch size. In general, increasing the 
+number of partitions should help performance as the number of rows to return increases. Determining the optimal batch 
+size depends both on the number of columns in each returned row and what kind of Spark operations are being invoked. 
+The next section describes both how the connector tries to optimize performance when an aggregation is performed
+and when the same kind of optimization should be made when not many rows need to be returned. 
 
 ### Optimizing for smaller result sets
 
 If your Optic query matches a set of rows whose count is a small percentage of the total number of rows in 
-the view that the query runs against, you may find improved performance by setting `spark.marklogic.read.batchSize` 
-to zero. Doing so ensures that for each partition, a single request is sent to MarkLogic. 
-
-If the result set matching your query is particularly small - such as thousands of rows or less, or possibly tens of 
-thousands of rows or less - you may find optimal performance by also setting `spark.marklogic.read.numPartitions` to 
-one. This will result in the connector sending a single request to MarkLogic. 
+the view that the query runs against, you should find improved performance by setting `spark.marklogic.read.batchSize` 
+to zero. This setting ensures that for each partition, a single request is sent to MarkLogic. 
+
+If your Spark program includes an aggregation that the connector can push down to MarkLogic, then the connector will 
+automatically use a batch size of zero unless you specify a different value for `spark.marklogic.read.batchSize`. This
+optimization should typically be desirable when calculating an aggregation, as MarkLogic will return far fewer rows 
+per request depending on the type of aggregation. 
+
+If the result set matching your query is particularly small - such as tens of thousands of rows or less, or possibly 
+hundreds of thousands of rows or less - you may find optimal performance by setting 
+`spark.marklogic.read.numPartitions` to one. This will result in the connector sending a single request to MarkLogic.
+The effectiveness of this approach can be evaluated by executing the Optic query via 
+[MarkLogic's qconsole application](https://docs.marklogic.com/guide/qconsole/intro), which will execute the query in
+a single request as well. 
 
 ### More detail on partitions
 
diff --git a/docs/writing.md b/docs/writing.md
@@ -4,8 +4,28 @@ title: Writing Data
 nav_order: 4
 ---
 
-The MarkLogic Spark connector allows for writing rows in a Spark DataFrame to MarkLogic as documents. The sections below
-provide more detail about how this process works and how it can be controlled.
+The MarkLogic Spark connector allows for writing rows in a Spark DataFrame to MarkLogic as documents. 
+The sections below provide more detail about how this process works and how it can be controlled.
+
+## Basic write operation
+
+As shown in the [Getting Started with PySpark guide](getting-started/pyspark.md), a basic write operation will define
+how the connector should connect to MarkLogic, the Spark mode to use, and zero or more other options:
+
+```
+df.write.format("com.marklogic.spark") \
+    .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8020") \
+    .option("spark.marklogic.write.collections", "write-test") \
+    .option("spark.marklogic.write.permissions", "rest-reader,read,rest-writer,update") \
+    .option("spark.marklogic.write.uriPrefix", "/write/") \
+    .mode("append") \
+    .save()
+```
+
+In the above example, only `format`, `spark.marklogic.client.uri` (or the other `spark.marklogic.client` options 
+that can be used to define the connection details), and `mode` (which must equal "append") are required; 
+the collections, permissions , and URI prefix are optional, though it is uncommon to write documents without any 
+permissions. 
 
 ## Controlling document content
 
diff --git a/src/main/java/com/marklogic/spark/reader/MarkLogicMicroBatchStream.java b/src/main/java/com/marklogic/spark/reader/MarkLogicMicroBatchStream.java
@@ -71,7 +71,7 @@ public InputPartition[] planInputPartitions(Offset start, Offset end) {
         int index = (int) ((LongOffset) end).offset();
         return index >= allBuckets.size() ?
             null :
-            new InputPartition[]{new PlanAnalysis.Partition(index, allBuckets.get(index))};
+            new InputPartition[]{new PlanAnalysis.Partition(index + "", allBuckets.get(index))};
     }
 
     @Override
diff --git a/src/main/java/com/marklogic/spark/reader/PlanAnalysis.java b/src/main/java/com/marklogic/spark/reader/PlanAnalysis.java
@@ -42,18 +42,6 @@ class PlanAnalysis implements Serializable {
         this.partitions = partitions;
     }
 
-    /**
-     * Copy constructor for creating a new plan analysis with the given plan and a single bucket. Used for pushing down
-     * aggregate operations that can be efficiently calculated by MarkLogic in a single request.
-     *
-     * @param boundedPlan
-     */
-    PlanAnalysis(JsonNode boundedPlan) {
-        this.boundedPlan = boundedPlan;
-        final String maxUnsignedLong = "18446744073709551615";
-        this.partitions = Arrays.asList(new Partition(0, new Bucket("0", maxUnsignedLong)));
-    }
-
     List<Bucket> getAllBuckets() {
         List<PlanAnalysis.Bucket> allBuckets = new ArrayList<>();
         partitions.forEach(partition -> allBuckets.addAll(partition.buckets));
@@ -95,16 +83,24 @@ static class Partition implements InputPartition, Serializable {
             }
         }
 
+        Partition(String identifier, Bucket bucket) {
+            this.identifier = identifier;
+            this.buckets = bucket != null ? Arrays.asList(bucket) : new ArrayList<>();
+        }
+
         /**
-         * For micro-batch reading, where each Spark task is intended to process a single bucket, and thus each
-         * partition should contain a single bucket.
+         * Similar to a copy constructor; used to construct a new Partition with a single bucket based on the
+         * buckets in the given Partition.
          *
-         * @param bucketIndex
-         * @param bucket
+         * @return
          */
-        Partition(int bucketIndex, Bucket bucket) {
-            this.identifier = bucketIndex + "";
-            this.buckets = Arrays.asList(bucket);
+        Partition mergeBuckets() {
+            if (buckets == null || buckets.isEmpty()) {
+                return new Partition(identifier, null);
+            }
+            String lowerBound = buckets.get(0).lowerBound;
+            String upperBound = buckets.get(buckets.size() - 1).upperBound;
+            return new Partition(identifier, new Bucket(lowerBound, upperBound));
         }
 
         @Override
diff --git a/src/main/java/com/marklogic/spark/reader/ReadContext.java b/src/main/java/com/marklogic/spark/reader/ReadContext.java
@@ -63,7 +63,12 @@ public class ReadContext extends ContextSupport {
     final static long serialVersionUID = 1;
 
     private final static Logger logger = LoggerFactory.getLogger(ReadContext.class);
-    private final static long DEFAULT_BATCH_SIZE = 10000;
+
+    // The ideal batch size depends highly on what a user chooses to do after a load() - and of course the user may
+    // choose to perform multiple operations on the dataset, each of which may benefit from a fairly different batch
+    // size. 100k has been chosen as the default batch size to strike a reasonable balance for operations that do need
+    // to collect all the rows, such as writing the dataset to another data source.
+    private final static long DEFAULT_BATCH_SIZE = 100000;
 
     private PlanAnalysis planAnalysis;
     private StructType schema;
@@ -200,6 +205,15 @@ void pushDownAggregation(Aggregation aggregation) {
             }
         }
 
+        if (!getProperties().containsKey(Options.READ_BATCH_SIZE)) {
+            logger.info("Batch size was not overridden, so modifying each partition to make a single request to improve " +
+                "performance of pushed down aggregation.");
+            List<PlanAnalysis.Partition> mergedPartitions = planAnalysis.partitions.stream()
+                .map(p -> p.mergeBuckets())
+                .collect(Collectors.toList());
+            this.planAnalysis = new PlanAnalysis(planAnalysis.boundedPlan, mergedPartitions);
+        }
+
         this.schema = newSchema;
     }
 
diff --git a/src/test/java/com/marklogic/spark/reader/AbstractPushDownTest.java b/src/test/java/com/marklogic/spark/reader/AbstractPushDownTest.java
@@ -42,9 +42,9 @@ void setup() {
     protected DataFrameReader newDefaultReader(SparkSession session) {
         return super.newDefaultReader(session)
             // Default to a single call to MarkLogic for push down tests to ensure that assertions on row counts are
-            // accurate. Any tests that care about having more than one partition are expected to override this.
-            .option(Options.READ_NUM_PARTITIONS, 1)
-            .option(Options.READ_BATCH_SIZE, 0);
+            // accurate (and via DEVEXP-488, the batch size is expected to be set to zero when an aggregate is pushed
+            // down). Any tests that care about having more than one partition are expected to override this.
+            .option(Options.READ_NUM_PARTITIONS, 1);
     }
 
     private synchronized void addToRowCount(long totalRowCount) {
diff --git a/src/test/java/com/marklogic/spark/reader/MergeBucketsTest.java b/src/test/java/com/marklogic/spark/reader/MergeBucketsTest.java
@@ -0,0 +1,42 @@
+package com.marklogic.spark.reader;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class MergeBucketsTest {
+
+    @Test
+    void threeBuckets() {
+        PlanAnalysis.Partition p = new PlanAnalysis.Partition(1, 0, 1500, 3, 1500);
+
+        assertEquals(3, p.buckets.size());
+        assertEquals("0", p.buckets.get(0).lowerBound);
+        assertEquals("500", p.buckets.get(0).upperBound);
+        assertEquals("501", p.buckets.get(1).lowerBound);
+        assertEquals("1001", p.buckets.get(1).upperBound);
+        assertEquals("1002", p.buckets.get(2).lowerBound);
+        assertEquals("1500", p.buckets.get(2).upperBound);
+
+        PlanAnalysis.Partition p2 = p.mergeBuckets();
+
+        assertEquals(1, p2.buckets.size());
+        assertEquals("0", p2.buckets.get(0).lowerBound);
+        assertEquals("1500", p2.buckets.get(0).upperBound);
+    }
+
+    @Test
+    void oneBucket() {
+        PlanAnalysis.Partition p = new PlanAnalysis.Partition(1, 0, 1000, 1, 1000);
+
+        assertEquals(1, p.buckets.size());
+        assertEquals("0", p.buckets.get(0).lowerBound);
+        assertEquals("1000", p.buckets.get(0).upperBound);
+
+        PlanAnalysis.Partition p2 = p.mergeBuckets();
+
+        assertEquals(1, p2.buckets.size());
+        assertEquals("0", p2.buckets.get(0).lowerBound);
+        assertEquals("1000", p2.buckets.get(0).upperBound);
+    }
+}
diff --git a/src/test/java/com/marklogic/spark/reader/PushDownFilterTest.java b/src/test/java/com/marklogic/spark/reader/PushDownFilterTest.java
@@ -250,10 +250,6 @@ void stringEndsWithNoMatch() {
     private Dataset<Row> newDataset() {
         return newDefaultReader()
             .option(Options.READ_OPTIC_QUERY, QUERY_WITH_NO_QUALIFIER)
-            // Use a single call to MarkLogic so it's easier to verify from the logging
-            // that only N rows were returned.
-            .option(Options.READ_NUM_PARTITIONS, 1)
-            .option(Options.READ_BATCH_SIZE, 0)
             .load();
     }
 
diff --git a/src/test/java/com/marklogic/spark/reader/PushDownFilterValueTypesTest.java b/src/test/java/com/marklogic/spark/reader/PushDownFilterValueTypesTest.java
@@ -88,8 +88,6 @@ private void verifyOneRowReturned(String filter) {
     private Dataset<Row> newDataset() {
         return newDefaultReader()
             .option(Options.READ_OPTIC_QUERY, "op.fromView('sparkTest', 'allTypes', '')")
-            .option(Options.READ_NUM_PARTITIONS, 1)
-            .option(Options.READ_BATCH_SIZE, 0)
             .load();
     }
 
diff --git a/src/test/java/com/marklogic/spark/reader/PushDownGroupByMaxTest.java b/src/test/java/com/marklogic/spark/reader/PushDownGroupByMaxTest.java
@@ -31,7 +31,6 @@ void multiplePartitions() {
         List<Row> rows = newDefaultReader()
             .option(Options.READ_OPTIC_QUERY, QUERY_WITH_NO_QUALIFIER)
             .option(Options.READ_NUM_PARTITIONS, 2)
-            .option(Options.READ_BATCH_SIZE, 0)
             .load()
             .groupBy("CitationID")
             .max("LuckyNumber")
@@ -42,7 +41,29 @@ void multiplePartitions() {
         assertTrue(countOfRowsReadFromMarkLogic > 5, "Because 2 partitions exist, it is expected that more than " +
             "5 rows in total are ready by the two partition readers, and then Spark will apply the aggregation on the " +
             "two sets of rows returned by the partition readers. There's a slight chance this assertion will fail in " +
-            "the event that all 15 rows are in one partition.");
+            "the event that all 15 rows are in one partition. Actual count: " + countOfRowsReadFromMarkLogic);
+        verifyRowsHaveCorrectValues(rows, "max(LuckyNumber)");
+    }
+
+    @Test
+    void customBatchSize() {
+        List<Row> rows = newDefaultReader()
+            .option(Options.READ_OPTIC_QUERY, QUERY_WITH_NO_QUALIFIER)
+            .option(Options.READ_NUM_PARTITIONS, 1)
+            .option(Options.READ_BATCH_SIZE, 5)
+            .load()
+            .groupBy("CitationID")
+            .max("LuckyNumber")
+            .orderBy("CitationID")
+            .collectAsList();
+
+        assertEquals(5, rows.size());
+        assertTrue(countOfRowsReadFromMarkLogic > 5, "If the user specifies a batch size, then the connector should " +
+            "not default it to zero when an aggregate is pushed down; the assumption is that the user has a good " +
+            "reason for specifying a batch size. With 1 partition and 15 matching rows and a batch size of 5, 3 " +
+            "requests should be made to MarkLogic, and it's expected that more than 5 rows are returned across " +
+            "those 3 requests (unless each of the 5 CitationID values have their rows in the same partition, which " +
+            "is very unlikely). Actual count: " + countOfRowsReadFromMarkLogic);
         verifyRowsHaveCorrectValues(rows, "max(LuckyNumber)");
     }
 
diff --git a/src/test/java/com/marklogic/spark/reader/PushDownOrderByAndLimitTest.java b/src/test/java/com/marklogic/spark/reader/PushDownOrderByAndLimitTest.java

Original file line number	Diff line number	Diff line change
`@@ -71,7 +71,7 @@ public InputPartition[] planInputPartitions(Offset start, Offset end) {`
`71`	`71`	`int index = (int) ((LongOffset) end).offset();`
`72`	`72`	`return index >= allBuckets.size() ?`
`73`	`73`	`null :`
`74`		`- new InputPartition[]{new PlanAnalysis.Partition(index, allBuckets.get(index))};`
	`74`	`+ new InputPartition[]{new PlanAnalysis.Partition(index + "", allBuckets.get(index))};`
`75`	`75`	`}`
`76`	`76`
`77`	`77`	`@Override`
Original file line number	Diff line number	Diff line change
`@@ -88,8 +88,6 @@ private void verifyOneRowReturned(String filter) {`
`88`	`88`	`private Dataset<Row> newDataset() {`
`89`	`89`	`return newDefaultReader()`
`90`	`90`	`.option(Options.READ_OPTIC_QUERY, "op.fromView('sparkTest', 'allTypes', '')")`
`91`		`- .option(Options.READ_NUM_PARTITIONS, 1)`
`92`		`- .option(Options.READ_BATCH_SIZE, 0)`
`93`	`91`	`.load();`
`94`	`92`	`}`
`95`	`93`