marklogic
diff --git a/‎docs/reading-data/optic.md
Lines changed: 27 additions & 17 deletions b/‎docs/reading-data/optic.md
Lines changed: 27 additions & 17 deletions
diff --git a/‎marklogic-spark-connector/src/main/java/com/marklogic/spark/ContextSupport.java
Lines changed: 0 additions & 3 deletions b/‎marklogic-spark-connector/src/main/java/com/marklogic/spark/ContextSupport.java
Lines changed: 0 additions & 3 deletions
diff --git a/‎marklogic-spark-connector/src/main/java/com/marklogic/spark/reader/optic/OpticReadContext.java
Lines changed: 51 additions & 65 deletions b/‎marklogic-spark-connector/src/main/java/com/marklogic/spark/reader/optic/OpticReadContext.java
Lines changed: 51 additions & 65 deletions
@@ -59,27 +59,32 @@ query expansion via [a thesaurus](https://docs.marklogic.com/guide/search-dev/th
 
 ## Optic query requirements
 
-As of the 2.0.0 release of the connector, the Optic query must use the
-[op.fromView](https://docs.marklogic.com/op.fromView) accessor function. Future releases of both the connector and
-MarkLogic will strive to relax this requirement.
-
-In addition, calls to `groupBy`, `orderBy`, `limit`, and `offset` should be performed via Spark instead of within
-the initial Optic query. A key benefit of Spark and the MarkLogic connector is the ability to execute the query in
-parallel via multiple Spark partitions. The aforementioned calls, if made in the Optic query, may not produce the
-expected results if more than one Spark partition is used or if more than one request is made to MarkLogic. The
-equivalent Spark operations should be called instead, or the connector should be configured to make a single request
-to MarkLogic. See the "Pushing down operations" and "Tuning performance" sections below for more information.
-
-Finally, the query must adhere to the handful of limitations imposed by the  
+**Starting with the 2.5.0 release**, an Optic query can use any 
+[data access function](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_66011) with one caveat - only Optic 
+queries that use `op.fromView` can be partitioned into multiple calls to MarkLogic. Optic queries that use any other 
+data access function have the following constraints:
+
+1. The connector will execute the query in a single call to MarkLogic. You will therefore need to ensure that the 
+call can complete without timing out. 
+2. The connector requires that the MarkLogic user have the necessary privileges to invoke the 
+[MarkLogic eval endpoint](https://docs.marklogic.com/REST/POST/v1/eval) along with the `xdmp-invoke` privilege.
+
+**Prior to the 2.5.0 release**, the Optic query must use the
+[op.fromView](https://docs.marklogic.com/op.fromView) accessor function. In addition, calls to `groupBy`, `orderBy`, `limit`, and `offset` should be 
+performed via Spark instead of within the initial Optic query. As the connector will partition `op.fromView` queries
+into multiple calls to MarkLogic, the aforementioned calls will likely not produce the expected results when more 
+than one request is made to MarkLogic. See the "Pushing down operations" and "Tuning performance" sections below for 
+more information.
+
+Finally, regardless of the Optic data access function you use, the query must adhere to the handful of limitations imposed by the  
 [Optic Query DSL](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710). A good practice in validating a
 query is to run it in your [MarkLogic server's qconsole tool](https://docs.marklogic.com/guide/qconsole) in a buffer
 with a query type of "Optic DSL".
 
 ## Schema inference
 
-The connector will infer a Spark schema automatically based on the view identified by `op.fromView` in
-the Optic query. Each column returned by your Optic query will be mapped to a Spark schema column with the
-same name and an appropriate type.
+The connector will infer a Spark schema automatically based your Optic query. Each column returned by your Optic query 
+will be mapped to a Spark schema column with the same name and an appropriate type.
 
 You may override this feature and provide your own schema instead. The example below shows how a custom schema can
 be provided within PySpark; this assumes that you have deployed the application in the
@@ -97,8 +102,9 @@ df.show()
 
 ## Accessing documents
 
-While the connector requires that an Optic query use `op.fromView` as its accessor function, documents can still be
-retrieved via the [Optic functions for joining documents](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_78437).
+If your Optic query uses the `op.fromView` access function, documents can still be
+retrieved via the [Optic functions for joining documents](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_78437). Starting with the 2.5.0 release, you can simply use
+`op.fromSearchDocs` instead, but only if your query can be executed in a single call to MarkLogic without timing out. 
 
 For example, the following query will find all matching rows and then retrieve the documents and URIs associated with
 those rows:
@@ -216,6 +222,10 @@ correct result, please [file an issue with this project](https://github.com/mark
 
 ## Tuning performance
 
+If you are using the 2.5.0 connector or later along with an Optic query that does not use the `op.fromView` data 
+access function, you can ignore this section. The performance of your query will be strictly based on the Optic query
+itself, which the connector does not impact. 
+
 The primary factor affecting connector performance when reading rows is how many requests are made to MarkLogic. In
 general, performance will be best when minimizing the number of requests to MarkLogic while ensuring that no single
 request attempts to return or process too much data.
 
@@ -7,8 +7,6 @@
 import com.marklogic.client.DatabaseClientFactory;
 import com.marklogic.client.document.DocumentManager;
 import com.marklogic.client.extra.okhttpclient.OkHttpClientConfigurator;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
 
 import java.io.Serializable;
 import java.util.HashMap;
@@ -19,7 +17,6 @@
 
 public class ContextSupport extends Context implements Serializable {
 
-    protected static final Logger logger = LoggerFactory.getLogger(ContextSupport.class);
     private final boolean configuratorWasAdded;
 
     // Java Client 6.5.0 has a bug in it (to be fixed in 6.5.1 or 6.6.0) where multiple threads that use a configurator
 
@@ -4,15 +4,11 @@
 package com.marklogic.spark.reader.optic;
 
 import com.fasterxml.jackson.databind.JsonNode;
-import com.fasterxml.jackson.databind.node.ArrayNode;
-import com.fasterxml.jackson.databind.node.ObjectNode;
 import com.marklogic.client.DatabaseClient;
 import com.marklogic.client.FailedRequestException;
 import com.marklogic.client.expression.PlanBuilder;
 import com.marklogic.client.impl.DatabaseClientImpl;
 import com.marklogic.client.io.JacksonHandle;
-import com.marklogic.client.io.StringHandle;
-import com.marklogic.client.row.RawQueryDSLPlan;
 import com.marklogic.client.row.RowManager;
 import com.marklogic.spark.ConnectorException;
 import com.marklogic.spark.ContextSupport;
@@ -25,8 +21,6 @@
 import org.apache.spark.sql.types.DataTypes;
 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
 
 import java.util.*;
 import java.util.stream.Collectors;
@@ -41,8 +35,6 @@ public class OpticReadContext extends ContextSupport {
 
     static final long serialVersionUID = 1;
 
-    private static final Logger logger = LoggerFactory.getLogger(OpticReadContext.class);
-
     // The ideal batch size depends highly on what a user chooses to do after a load() - and of course the user may
     // choose to perform multiple operations on the dataset, each of which may benefit from a fairly different batch
     // size. 100k has been chosen as the default batch size to strike a reasonable balance for operations that do need
@@ -51,43 +43,43 @@ public class OpticReadContext extends ContextSupport {
 
     private PlanAnalysis planAnalysis;
     private StructType schema;
-    private long serverTimestamp;
     private List<OpticFilter> opticFilters;
     private final long batchSize;
 
     public OpticReadContext(Map<String, String> properties, StructType schema, int defaultMinPartitions) {
         super(properties);
-        this.schema = schema;
-
-        final long partitionCount = getNumericOption(Options.READ_NUM_PARTITIONS, defaultMinPartitions, 1);
-        this.batchSize = getNumericOption(Options.READ_BATCH_SIZE, DEFAULT_BATCH_SIZE, 0);
 
         final String dslQuery = properties.get(Options.READ_OPTIC_QUERY);
         if (dslQuery == null || dslQuery.trim().length() < 1) {
             throw new ConnectorException(Util.getOptionNameForErrorMessage("spark.marklogic.read.noOpticQuery"));
         }
 
-        DatabaseClient client = connectToMarkLogic();
-        RawQueryDSLPlan dslPlan = client.newRowManager().newRawQueryDSLPlan(new StringHandle(dslQuery));
-
-        try {
-            this.planAnalysis = new PlanAnalyzer((DatabaseClientImpl) client).analyzePlan(
-                dslPlan.getHandle(), partitionCount, batchSize
-            );
-        } catch (FailedRequestException ex) {
-            handlePlanAnalysisError(dslQuery, ex);
-        }
+        this.schema = schema;
+        this.batchSize = getNumericOption(Options.READ_BATCH_SIZE, DEFAULT_BATCH_SIZE, 0);
+        this.planAnalysis = analyzePlan(dslQuery, getNumericOption(Options.READ_NUM_PARTITIONS, defaultMinPartitions, 1));
 
         if (this.planAnalysis != null) {
             if (Util.MAIN_LOGGER.isInfoEnabled()) {
                 Util.MAIN_LOGGER.info("Partition count: {}; number of requests that will be made to MarkLogic: {}",
                     this.planAnalysis.getPartitions().size(), this.planAnalysis.getAllBuckets().size());
             }
-            // Calling this to establish a server timestamp.
-            StringHandle columnInfoHandle = client.newRowManager().columnInfo(dslPlan, new StringHandle());
-            this.serverTimestamp = columnInfoHandle.getServerTimestamp();
-            if (logger.isDebugEnabled()) {
-                logger.debug("Will use server timestamp: {}", serverTimestamp);
+            if (Util.MAIN_LOGGER.isDebugEnabled() && planAnalysis.getServerTimestamp() > 0) {
+                Util.MAIN_LOGGER.debug("Will use server timestamp: {}", planAnalysis.getServerTimestamp());
+            }
+        }
+    }
+
+    private PlanAnalysis analyzePlan(final String dslQuery, final long partitionCount) {
+        DatabaseClient client = null;
+        try {
+            client = connectToMarkLogic();
+            return new PlanAnalyzer((DatabaseClientImpl) client).analyzePlan(dslQuery, partitionCount, batchSize);
+        } catch (FailedRequestException ex) {
+            handlePlanAnalysisError(dslQuery, ex);
+            return null;
+        } finally {
+            if (client != null) {
+                client.release();
             }
         }
     }
@@ -102,33 +94,44 @@ private void handlePlanAnalysisError(String query, FailedRequestException ex) {
     }
 
     Iterator<JsonNode> readRowsInBucket(RowManager rowManager, PlanAnalysis.Partition partition, PlanAnalysis.Bucket bucket) {
-        if (logger.isDebugEnabled()) {
-            logger.debug("Getting rows for partition {} and bucket {} at server timestamp {}", partition, bucket, serverTimestamp);
+        final long serverTimestamp = planAnalysis.getServerTimestamp();
+        if (Util.MAIN_LOGGER.isDebugEnabled()) {
+            if (serverTimestamp > 0) {
+                Util.MAIN_LOGGER.debug("Getting rows for partition {} and bucket {} at server timestamp {}", partition, bucket, serverTimestamp);
+            } else {
+                Util.MAIN_LOGGER.debug("Getting rows for partition {} and bucket {}", partition, bucket);
+            }
         }
 
         // This should never occur, as a query should only ever occur when rows were initially found, which leads to a
         // server timestamp being captured. But if it were somehow to occur, we should error out as the row-ID-based
         // partitions are not reliable without a consistent server timestamp.
-        if (serverTimestamp < 1) {
+        if (serverTimestamp < 1 && !bucket.isSingleCallToMarkLogic()) {
             throw new ConnectorException(String.format("Unable to read rows; invalid server timestamp: %d", serverTimestamp));
         }
 
-        PlanBuilder.Plan plan = buildPlanForBucket(rowManager, bucket);
-        JacksonHandle jsonHandle = new JacksonHandle();
-        jsonHandle.setPointInTimeQueryTimestamp(serverTimestamp);
+        final PlanBuilder.Plan plan = buildPlanForBucket(rowManager, bucket);
+        final JacksonHandle jsonHandle = new JacksonHandle();
+        if (!bucket.isSingleCallToMarkLogic()) {
+            jsonHandle.setPointInTimeQueryTimestamp(serverTimestamp);
+        }
+
         // Remarkably, the use of resultDoc has consistently proven to be a few percentage points faster than using
         // resultRows with a StringHandle, even though the latter avoids the need for converting to and from a JsonNode.
         // The overhead with resultRows may be due to the processing of a multipart response; it's not yet clear.
-        JsonNode result = rowManager.resultDoc(plan, jsonHandle).get();
+        final JsonNode result = rowManager.resultDoc(plan, jsonHandle).get();
         return result != null && result.has("rows") ?
             result.get("rows").iterator() :
             new ArrayList<JsonNode>().iterator();
     }
 
     private PlanBuilder.Plan buildPlanForBucket(RowManager rowManager, PlanAnalysis.Bucket bucket) {
-        PlanBuilder.Plan plan = rowManager.newRawPlanDefinition(new JacksonHandle(planAnalysis.getBoundedPlan()))
-            .bindParam("ML_LOWER_BOUND", bucket.lowerBound)
-            .bindParam("ML_UPPER_BOUND", bucket.upperBound);
+        PlanBuilder.Plan plan = rowManager.newRawPlanDefinition(new JacksonHandle(planAnalysis.getSerializedPlan()));
+
+        if (!bucket.isSingleCallToMarkLogic()) {
+            plan = plan.bindParam("ML_LOWER_BOUND", bucket.lowerBound)
+                .bindParam("ML_UPPER_BOUND", bucket.upperBound);
+        }
 
         if (opticFilters != null) {
             for (OpticFilter opticFilter : opticFilters) {
@@ -143,15 +146,15 @@ void pushDownFiltersIntoOpticQuery(List<OpticFilter> opticFilters) {
         this.opticFilters = opticFilters;
         // Add each filter in a separate "where" so we don't toss an op.sqlCondition into an op.and,
         // which Optic does not allow.
-        opticFilters.forEach(filter -> addOperatorToPlan(PlanUtil.buildWhere(filter)));
+        opticFilters.forEach(filter -> planAnalysis.pushOperatorIntoPlan(PlanUtil.buildWhere(filter)));
     }
 
     void pushDownLimit(int limit) {
-        addOperatorToPlan(PlanUtil.buildLimit(limit));
+        planAnalysis.pushOperatorIntoPlan(PlanUtil.buildLimit(limit));
     }
 
     void pushDownTopN(SortOrder[] orders, int limit) {
-        addOperatorToPlan(PlanUtil.buildOrderBy(orders));
+        planAnalysis.pushOperatorIntoPlan(PlanUtil.buildOrderBy(orders));
         pushDownLimit(limit);
     }
 
@@ -160,10 +163,10 @@ void pushDownAggregation(Aggregation aggregation) {
             .map(PlanUtil::expressionToColumnName)
             .collect(Collectors.toList());
 
-        if (logger.isDebugEnabled()) {
-            logger.debug("groupBy column names: {}", groupByColumnNames);
+        if (Util.MAIN_LOGGER.isDebugEnabled()) {
+            Util.MAIN_LOGGER.debug("groupBy column names: {}", groupByColumnNames);
         }
-        addOperatorToPlan(PlanUtil.buildGroupByAggregation(new HashSet<>(groupByColumnNames), aggregation));
+        planAnalysis.pushOperatorIntoPlan(PlanUtil.buildGroupByAggregation(new HashSet<>(groupByColumnNames), aggregation));
 
         StructType newSchema = buildSchemaWithColumnNames(groupByColumnNames);
 
@@ -186,10 +189,8 @@ void pushDownAggregation(Aggregation aggregation) {
                 Sum sum = (Sum) func;
                 StructField field = findColumnInSchema(sum.column(), PlanUtil.expressionToColumnName(sum.column()));
                 newSchema = newSchema.add(func.toString(), field.dataType());
-            } else {
-                if (logger.isDebugEnabled()) {
-                    logger.debug("Unsupported aggregate function: {}", func);
-                }
+            } else if (Util.MAIN_LOGGER.isDebugEnabled()) {
+                Util.MAIN_LOGGER.debug("Unsupported aggregate function: {}", func);
             }
         }
 
@@ -199,7 +200,7 @@ void pushDownAggregation(Aggregation aggregation) {
             List<PlanAnalysis.Partition> mergedPartitions = planAnalysis.getPartitions().stream()
                 .map(p -> p.mergeBuckets())
                 .collect(Collectors.toList());
-            this.planAnalysis = new PlanAnalysis(planAnalysis.getBoundedPlan(), mergedPartitions);
+            this.planAnalysis = new PlanAnalysis(planAnalysis.getSerializedPlan(), mergedPartitions, planAnalysis.getServerTimestamp());
         }
 
         if (Util.MAIN_LOGGER.isDebugEnabled()) {
@@ -237,7 +238,7 @@ private StructField findColumnInSchema(Expression expression, String columnName)
 
     void pushDownRequiredSchema(StructType requiredSchema) {
         this.schema = requiredSchema;
-        addOperatorToPlan(PlanUtil.buildSelect(requiredSchema));
+        planAnalysis.pushOperatorIntoPlan(PlanUtil.buildSelect(requiredSchema));
     }
 
     boolean planAnalysisFoundNoRows() {
@@ -246,21 +247,6 @@ boolean planAnalysisFoundNoRows() {
         return planAnalysis == null;
     }
 
-    /**
-     * The internal/viewinfo endpoint is known to add an op:prepare operator at the end of the list of operator
-     * args. Each operator added by the connector based on pushdowns needs to be added before this op:prepare
-     * operator; otherwise, MarkLogic will throw an error.
-     *
-     * @param operator
-     */
-    private void addOperatorToPlan(ObjectNode operator) {
-        if (logger.isDebugEnabled()) {
-            logger.debug("Adding operator to plan: {}", operator);
-        }
-        ArrayNode operators = (ArrayNode) planAnalysis.getBoundedPlan().get("$optic").get("args");
-        operators.insert(operators.size() - 1, operator);
-    }
-
     StructType getSchema() {
         return schema;
     }