Removed queryFormat and improved docs on performance

rjrudin · rjrudin · commit 033a9b55ce60 · 2024-02-07T10:29:44.000-05:00
Also changed default number of partitions per forest to 4 based on the default setup of 3 forests per host and 32 app server threads.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -153,6 +153,7 @@ The following options control how the connector reads document rows from MarkLog
 | spark.marklogic.read.documents.transform | Name of a [MarkLogic REST transform](https://docs.marklogic.com/guide/rest-dev/transforms) to apply to each matching document. |
 | spark.marklogic.read.documents.transformParams | Comma-delimited sequence of transform parameter names and values - e.g. `param1,value1,param2,value`. |
 | spark.marklogic.read.documents.transformParamsDelimiter | Delimiter for transform parameters; defaults to a comma. |
+| spark.marklogic.read.documents.partitionsPerForest | Number of Spark partition readers to create per forest; defaults to 4. |
 
 ## Write options
 
diff --git a/docs/reading-data/documents.md b/docs/reading-data/documents.md
@@ -31,7 +31,9 @@ df = spark.read.format("marklogic") \
 df.show()
 ```
 
-You can also submit structured queries, serialized CTS queries, and combined queries via 
+You can also submit a [structured query](https://docs.marklogic.com/guide/search-dev/structured-query#), a 
+[serialized CTS query](https://docs.marklogic.com/guide/rest-dev/search#id_30577), or a 
+[combined query](https://docs.marklogic.com/guide/rest-dev/search#id_69918) via 
 `spark.marklogic.read.documents.query`, which can be combined with a string query as well:
 
 ```
@@ -185,15 +187,26 @@ doc['Department']
 ## Understanding performance
 
 The connector mimics the behavior of the [MarkLogic Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement)
-by creating a Spark partition per forest in the database associated with your REST API app server. Each partition reader
-will return all matching documents from its associated forest. The option `spark.marklogic.read.batchSize` controls how
-many documents will be returned in each call to MarkLogic; its value defaults to 500. For smaller documents, 
-particularly those with 10 elements or fewer, you may find a batch size of 1,000 or even 10,000 to provide better
-performance.
-
-The `spark.marklogic.read.numPartitions` option does not impact performance when reading document rows, as 1 partition 
-is always created for each forest. It is not possible for 2 or more partition readers to read from the same forest.
-
-You can adjust the level of parallelism by controlling how many threads Spark uses for executing partition reads. 
+by creating Spark partition readers that are assigned to a specific forest. By default, the connector will create 
+4 readers per forest. You can use the `spark.marklogic.read.documents.partitionsPerForest` option to control
+the number of readers. You should adjust this based on your cluster configuration. For example,a default REST API app 
+server will have 32 server threads and 3 forests per host. 4 partition readers will thus consume 12 of the 32 server
+threads. If the app server is not servicing any other requests, performance will typically be improved by configuring
+8 partitions per forest. Note that the `spark.marklogic.read.numPartitions` option does not have any impact;
+that is only used when reading via an Optic query.
+
+Each partition reader will make one to many calls to MarkLogic to retrieve documents. The 
+`spark.marklogic.read.batchSize` option controls how many documents will be retrieved in a call. The value defaults
+to 500. For smaller documents, particularly those with 10 elements or fewer, you may find a batch size of 1,000 or 
+even 10,000 to provide better performance.
+
+As an example, consider a query that matches 120,000 documents in a cluster with 3 hosts and 2 forests on each host. 
+The connector will default to creating 24 partitions - 4 for each of the 6 forests. Each partition reader will read
+approximately 5,000 documents. With a default batch size of 500, each partition reader will make approximately 10 
+calls to MarkLogic (these numbers are all approximate as a forest may have slightly more or less than 20,000 documents).
+Depending on the size of the documents and whether the cluster is servicing other requests, performance may improve
+with more partition readers and a higher batch size. 
+
+You can also adjust the level of parallelism by controlling how many threads Spark uses for executing partition reads. 
 Please see your Spark distribution's documentation for further information.
 
diff --git a/src/main/java/com/marklogic/spark/Options.java b/src/main/java/com/marklogic/spark/Options.java
@@ -48,7 +48,6 @@ public abstract class Options {
     public static final String READ_DOCUMENTS_STRING_QUERY = "spark.marklogic.read.documents.stringQuery";
     // Corresponds to the complex query submitted via the request body at https://docs.marklogic.com/REST/POST/v1/search .
     public static final String READ_DOCUMENTS_QUERY = "spark.marklogic.read.documents.query";
-    public static final String READ_DOCUMENTS_QUERY_FORMAT = "spark.marklogic.read.documents.queryFormat";
     public static final String READ_DOCUMENTS_OPTIONS = "spark.marklogic.read.documents.options";
     public static final String READ_DOCUMENTS_DIRECTORY = "spark.marklogic.read.documents.directory";
     public static final String READ_DOCUMENTS_TRANSFORM = "spark.marklogic.read.documents.transform";
diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentContext.java b/src/main/java/com/marklogic/spark/reader/document/DocumentContext.java
@@ -51,7 +51,6 @@ SearchQueryDefinition buildSearchQuery(DatabaseClient client) {
         return new SearchQueryBuilder()
             .withStringQuery(props.get(Options.READ_DOCUMENTS_STRING_QUERY))
             .withQuery(props.get(Options.READ_DOCUMENTS_QUERY))
-            .withQueryFormat(props.get(Options.READ_DOCUMENTS_QUERY_FORMAT))
             .withCollections(props.get(Options.READ_DOCUMENTS_COLLECTIONS))
             .withDirectory(props.get(Options.READ_DOCUMENTS_DIRECTORY))
             .withOptionsName(props.get(Options.READ_DOCUMENTS_OPTIONS))
@@ -70,7 +69,7 @@ int getBatchSize() {
     }
 
     int getPartitionsPerForest() {
-        int defaultPartitionsPerForest = 2;
+        int defaultPartitionsPerForest = 4;
         return (int) getNumericOption(Options.READ_DOCUMENTS_PARTITIONS_PER_FOREST, defaultPartitionsPerForest, 1);
     }
 }
diff --git a/src/main/java/com/marklogic/spark/reader/document/SearchQueryBuilder.java b/src/main/java/com/marklogic/spark/reader/document/SearchQueryBuilder.java
@@ -14,7 +14,6 @@ public class SearchQueryBuilder {
 
     private String stringQuery;
     private String query;
-    private Format queryFormat;
     private String[] collections;
     private String directory;
     private String optionsName;
@@ -51,13 +50,6 @@ public SearchQueryBuilder withQuery(String query) {
         return this;
     }
 
-    public SearchQueryBuilder withQueryFormat(String format) {
-        if (format != null) {
-            this.queryFormat = Format.valueOf(format.toUpperCase());
-        }
-        return this;
-    }
-
     public SearchQueryBuilder withCollections(String value) {
         if (value != null) {
             this.collections = value.split(",");
@@ -92,13 +84,15 @@ public SearchQueryBuilder withTransformParamsDelimiter(String delimiter) {
 
     private QueryDefinition buildQueryDefinition(DatabaseClient client) {
         final QueryManager queryManager = client.newQueryManager();
-        // The Java Client misleadingly suggests a distinction amongst the 3 complex queries - structured,
-        // serialized CTS, and combined - but the REST API does not.
         if (query != null) {
             StringHandle queryHandle = new StringHandle(query);
-            if (queryFormat != null) {
-                queryHandle.withFormat(queryFormat);
+            // v1/search assumes XML by default, so only need to set to JSON if the query is JSON.
+            if (queryIsJSON()) {
+                queryHandle.withFormat(Format.JSON);
             }
+            // The Java Client misleadingly suggests a distinction amongst the 3 complex queries - structured,
+            // serialized CTS, and combined - but the REST API does not. Thus, a RawStructuredQueryDefinition will work
+            // for any of the 3 query types.
             RawStructuredQueryDefinition queryDefinition = queryManager.newRawStructuredQueryDefinition(queryHandle);
             if (stringQuery != null && stringQuery.length() > 0) {
                 queryDefinition.withCriteria(stringQuery);
@@ -112,6 +106,10 @@ private QueryDefinition buildQueryDefinition(DatabaseClient client) {
         return queryDefinition;
     }
 
+    private boolean queryIsJSON() {
+        return query != null && query.trim().startsWith("{");
+    }
+
     private void applyCommonQueryConfig(QueryDefinition queryDefinition) {
         if (optionsName != null && optionsName.trim().length() > 0) {
             queryDefinition.setOptionsName(optionsName);
diff --git a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsTest.java b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsTest.java
@@ -136,7 +136,6 @@ void structuredQueryJSON() {
         String query = "{ \"query\": { \"queries\": [{ \"term-query\": { \"text\": [ \"Moria\" ] } }] } }";
         List<Row> rows = startRead()
             .option(Options.READ_DOCUMENTS_QUERY, query)
-            .option(Options.READ_DOCUMENTS_QUERY_FORMAT, "jsON")
             .load()
             .collectAsList();
 
@@ -164,7 +163,6 @@ void serializedCTSQueryJSON() {
 
         List<Row> rows = startRead()
             .option(Options.READ_DOCUMENTS_QUERY, query)
-            .option(Options.READ_DOCUMENTS_QUERY_FORMAT, "JSON")
             .load()
             .collectAsList();
 
@@ -197,7 +195,6 @@ void combinedQueryJSON() {
 
         List<Row> rows = startRead()
             .option(Options.READ_DOCUMENTS_QUERY, combinedQuery.toString())
-            .option(Options.READ_DOCUMENTS_QUERY_FORMAT, "json")
             .load()
             .collectAsList();
 

Original file line number	Diff line number	Diff line change
`@@ -51,7 +51,6 @@ SearchQueryDefinition buildSearchQuery(DatabaseClient client) {`
`51`	`51`	`return new SearchQueryBuilder()`
`52`	`52`	`.withStringQuery(props.get(Options.READ_DOCUMENTS_STRING_QUERY))`
`53`	`53`	`.withQuery(props.get(Options.READ_DOCUMENTS_QUERY))`
`54`		`- .withQueryFormat(props.get(Options.READ_DOCUMENTS_QUERY_FORMAT))`
`55`	`54`	`.withCollections(props.get(Options.READ_DOCUMENTS_COLLECTIONS))`
`56`	`55`	`.withDirectory(props.get(Options.READ_DOCUMENTS_DIRECTORY))`
`57`	`56`	`.withOptionsName(props.get(Options.READ_DOCUMENTS_OPTIONS))`
`@@ -70,7 +69,7 @@ int getBatchSize() {`
`70`	`69`	`}`
`71`	`70`
`72`	`71`	`int getPartitionsPerForest() {`
`73`		`- int defaultPartitionsPerForest = 2;`
	`72`	`+ int defaultPartitionsPerForest = 4;`
`74`	`73`	`return (int) getNumericOption(Options.READ_DOCUMENTS_PARTITIONS_PER_FOREST, defaultPartitionsPerForest, 1);`
`75`	`74`	`}`
`76`	`75`	`}`