MLE-12420 Docs for 2.2.0

rjrudin · rjrudin · commit 74ab9287db10 · 2024-02-19T10:06:37.000-05:00
Not doing the file reading/writing stuff yet.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -145,15 +145,15 @@ The following options control how the connector reads document rows from MarkLog
 | Option | Description | 
 | --- | --- |
 | spark.marklogic.read.documents.stringQuery | A [MarkLogic string query](https://docs.marklogic.com/guide/search-dev/string-query) for selecting documents. |
-| spark.marklogic.read.documents.query | A JSON or XML representation of a structured query, serialized CTS query, or combined query. |
+| spark.marklogic.read.documents.query | A JSON or XML representation of a [structured query](https://docs.marklogic.com/guide/search-dev/structured-query#), [serialized CTS query](https://docs.marklogic.com/guide/rest-dev/search#id_30577), or [combined query](https://docs.marklogic.com/guide/rest-dev/search#id_69918). |
 | spark.marklogic.read.documents.categories | Controls which metadata is returned for each document. Defaults to `content`. Allowable values are `content`, `metadata`, `collections`, `permissions`, `quality`, `properties`, and `metadatavalues`. |
 | spark.marklogic.read.documents.collections | Comma-delimited string of zero to many collections to constrain the query. |
 | spark.marklogic.read.documents.directory | Database directory - e.g. "/company/employees/" - to constrain the query. |
 | spark.marklogic.read.documents.options | Name of a set of [MarkLogic search options](https://docs.marklogic.com/guide/search-dev/query-options) to be applied against a string query. |
+| spark.marklogic.read.documents.partitionsPerForest | Number of Spark partition readers to create per forest; defaults to 4. |
 | spark.marklogic.read.documents.transform | Name of a [MarkLogic REST transform](https://docs.marklogic.com/guide/rest-dev/transforms) to apply to each matching document. |
 | spark.marklogic.read.documents.transformParams | Comma-delimited sequence of transform parameter names and values - e.g. `param1,value1,param2,value`. |
 | spark.marklogic.read.documents.transformParamsDelimiter | Delimiter for transform parameters; defaults to a comma. |
-| spark.marklogic.read.documents.partitionsPerForest | Number of Spark partition readers to create per forest; defaults to 4. |
 
 ## Write options
 
diff --git a/docs/getting-started/pyspark.md b/docs/getting-started/pyspark.md
@@ -78,6 +78,17 @@ The `df` variable is an instance of a Spark DataFrame. Try the following command
 The [PySpark docs](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html) provide more
 information on how a Spark DataFrame works along with more commands that you can try on it.
 
+As of the connector 2.2.0 release, you can also query for documents, receiving "document" rows that contain columns
+capturing the URI, content, and metadata for each document:
+
+```
+df = spark.read.format("marklogic") \
+    .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
+    .option("spark.marklogic.read.documents.collections", "employee") \
+    .load()
+df.show()
+```
+
 The instructions above can be applied to your own MarkLogic application. You can use the same Spark command above,
 simply adjusting the connection details and the Optic query. Please see 
 [the guide on reading data](../reading-data/reading.md) for more information on how data can be read from MarkLogic, 
diff --git a/docs/reading-data/documents.md b/docs/reading-data/documents.md
@@ -18,10 +18,12 @@ when data needs to be retrieved and an [Optic query](optic.md) is not a practica
 
 ## Usage
 
-This will be cleaned up before the 2.2.0 release, just getting the basics in place. 
+To read documents from MarkLogic, you must specify at least one of 4 supported query types described below - a string 
+query; a structured, serialized CTS, or combined query; a collection query; or a directory query. You may specify any
+combination of those 4 query types as well. 
 
-General approach is to specify any combination of a string query, a complex query, collections, and a directory. A 
-string query is configured via `spark.marklogic.read.documents.stringQuery`:
+You can specify a [string query](https://docs.marklogic.com/guide/search-dev/string-query) that utilizes 
+MarkLogic's search grammar via the `spark.marklogic.read.documents.stringQuery` option:
 
 ```
 df = spark.read.format("marklogic") \
@@ -31,6 +33,9 @@ df = spark.read.format("marklogic") \
 df.show()
 ```
 
+The document content is in a column named `content` of type `binary`. See further below for an example of how to use
+common Spark functions to cast this value to a string or parse it into a JSON object. 
+
 You can also submit a [structured query](https://docs.marklogic.com/guide/search-dev/structured-query#), a 
 [serialized CTS query](https://docs.marklogic.com/guide/rest-dev/search#id_30577), or a 
 [combined query](https://docs.marklogic.com/guide/rest-dev/search#id_69918) via 
@@ -43,13 +48,15 @@ df = spark.read.format("marklogic") \
     .option("spark.marklogic.read.documents.query", '{"query": {"queries": [{"term-query": {"text": ["Engineering"]} }] } }') \
     .load()
 df.show()
+df.count()
 
 # Serialized CTS query
 df = spark.read.format("marklogic") \
     .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
     .option("spark.marklogic.read.documents.query", '{"ctsquery": {"wordQuery": {"text": "Engineering"}}}') \
     .load()
 df.show()
+df.count()
 
 # Combined query
 query = "<search xmlns='http://marklogic.com/appservices/search'>\
@@ -61,6 +68,7 @@ df = spark.read.format("marklogic") \
     .option("spark.marklogic.read.documents.query", query) \
     .load()
 df.show()
+df.count()
 ```
 
 ## Querying by collections
@@ -73,6 +81,7 @@ df = spark.read.format("marklogic") \
     .option("spark.marklogic.read.documents.collections", "employee") \
     .load()
 df.show()
+df.count()
 ```
 
 You can also specify collections with any of the above queries:
@@ -84,6 +93,7 @@ df = spark.read.format("marklogic") \
     .option("spark.marklogic.read.documents.stringQuery", "Marketing") \
     .load()
 df.show()
+df.count()
 ```
 
 ## Querying by directory
@@ -96,20 +106,23 @@ df = spark.read.format("marklogic") \
     .option("spark.marklogic.read.documents.directory", "/employee/") \
     .load()
 df.show()
+df.count()
 ```
 
 ## Using query options
 
 If you have a set of [MarkLogic query options](https://docs.marklogic.com/guide/search-dev/query-options) installed in 
-your REST API app server, you can reference these via `spark.marklogic.read.documents.options`.
+your REST API app server, you can reference these via `spark.marklogic.read.documents.options`. You will then typically
+use the `spark.marklogic.read.documents.stringQuery` option and reference one or more constraints defined in your 
+query options.
 
 ## Requesting document metadata
 
-By default, each document row will only have its `URI`, `content`, and `format` columns populated. You can use the 
+By default, each row will only have its `URI`, `content`, and `format` columns populated. You can use the 
 `spark.marklogic.read.documents.categories` option to request metadata for each document. The value of the option 
 must be a comma-delimited list of one or more of the following values: 
 
-- `content` will result in the `content` and `format` columns being populated. If excluded, neither will be populated.
+- `content` will result in the `content` and `format` columns being populated. If excluded from the option value, neither will be populated.
 - `metadata` will result in all metadata columns - collections, permissions, quality, properties, and metadata values - 
 being populated.
 - `collections`, `permissions`, `quality`, `properties`, and `metadatavalues` can be used to request each metadata type
@@ -134,7 +147,15 @@ df.show(2)
 +--------------------+--------------------+------+-----------+--------------------+-------+----------+--------------+
 ```
 
-A value of `collections,permissions`
+Note that the Spark `show()` function allows for results to be displayed in a vertical format instead of in a table. 
+You can more easily see values in the metadata columns by requesting a vertical format and dropping the `content` column:
+
+```
+df.drop("content").show(2, 0, True)
+```
+
+A value of `collections,permissions` will result in the `content` and `format` columns being empty and the `collections`
+and `permissions` columns being populated:
 
 ```
 df = spark.read.format("marklogic") \
@@ -184,16 +205,20 @@ doc = json.loads(df2.head()['content'])
 doc['Department']
 ```
 
-## Understanding performance
+## Tuning performance
 
 The connector mimics the behavior of the [MarkLogic Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement)
 by creating Spark partition readers that are assigned to a specific forest. By default, the connector will create 
-4 readers per forest. You can use the `spark.marklogic.read.documents.partitionsPerForest` option to control
-the number of readers. You should adjust this based on your cluster configuration. For example,a default REST API app 
-server will have 32 server threads and 3 forests per host. 4 partition readers will thus consume 12 of the 32 server
-threads. If the app server is not servicing any other requests, performance will typically be improved by configuring
-8 partitions per forest. Note that the `spark.marklogic.read.numPartitions` option does not have any impact;
-that is only used when reading via an Optic query.
+4 readers per forest. Each reader will read URIs and documents in a specific range of URIs at a specific MarkLogic 
+server timestamp, ensuring both that every matching document is retrieved and that the same document is never returned 
+more than once for a query.
+
+You can use the `spark.marklogic.read.documents.partitionsPerForest` option to control the number of readers. You 
+should adjust this based on your cluster configuration. For example, a default REST API app server will have 32 server 
+threads and 3 forests per host. 4 partition readers will thus utilize 12 of the 32 server threads. If the app server 
+is not servicing any other requests, performance will typically be improved by configuring 8 partitions per forest. 
+Note that the `spark.marklogic.read.numPartitions` option does not have any impact; that is only used when reading 
+via an Optic query.
 
 Each partition reader will make one to many calls to MarkLogic to retrieve documents. The 
 `spark.marklogic.read.batchSize` option controls how many documents will be retrieved in a call. The value defaults
diff --git a/docs/writing.md b/docs/writing.md
@@ -37,22 +37,46 @@ that can be used to define the connection details), and `mode` (which must equal
 the collections, permissions , and URI prefix are optional, though it is uncommon to write documents without any 
 permissions. 
 
-### Writing file rows as document
+### Writing file rows as documents
 
 To support the common use case of reading files and ingesting their contents as-is into MarkLogic, the connector has
 special support for rows with a schema matching that of 
 [Spark's binaryFile data source](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html). If the incoming
 rows adhere to the `binaryFile` schema, the connector will not serialize the row into JSON. Instead, the connector will 
 use the `path` column value as an initial URI for the document and the `content` column value as the document contents.
-
-The URI can then be further adjusted as described in the "Controlling document URIs"
-The URI can then be adjusted as described in the "Controlling documents URIs" section below.
+The URI can then be further adjusted as described in the "Controlling document URIs".
 
 This feature allows for ingesting files of any type. The MarkLogic REST API will
 [determine the document type](https://docs.marklogic.com/guide/rest-dev/intro#id_53367) based on the URI extension, if
 MarkLogic recognizes it. If MarkLogic does not recognize the extension, and you wish to force a document type on each of
 the documents, you can set the `spark.marklogic.write.files.documentType` option to one of `XML`, `JSON`, or `TEXT`.
 
+### Writing document rows
+
+As of the 2.2.0 release, you can [read documents from MarkLogic](reading-data/documents.md). A common use case is to then write these rows
+to another database, or another MarkLogic cluster, or even the same database the documents were read from, potentially
+transforming them and altering their URIs. 
+
+"Document rows" adhere to the following Spark schema, which is important to understand when writing these rows as 
+documents to MarkLogic:
+
+1. `URI` is of type `string`.
+2. `content` is of type `binary`.
+3. `format` is of type `string`.
+4. `collections` is an array of `string`s.
+5. `permissions` is a map with keys of type `string` and values that are arrays of `string`s. 
+6. `quality` is an `integer`.
+7. `properties` is a map with keys and values of type `string`.
+8. `metadataValues` is a map with keys and values of type `string`.
+
+Writing rows corresponding to the "document row" schema is largely the same as writing rows of any arbitrary schema, 
+but bear in mind these differences:
+
+1. All the column values will be honored if populated. 
+2. The `collections` and `permissions` will be replaced - not added to - if the `spark.marklogic.write.collections` and 
+`spark.marklogic.write.permissions` options are specified.
+3. The `spark.marklogic.write.uriTemplate` option is less useful as only the `URI` and `format` column values are available for use in the template.
+
 ### Controlling document content
 
 Rows in a Spark DataFrame are written to MarkLogic by default as JSON documents. Each column in a row becomes a 
@@ -198,6 +222,17 @@ Optimizing performance will thus involve testing various combinations of partiti
 counts. The [MarkLogic Monitoring tool](https://docs.marklogic.com/guide/monitoring/intro) can help you understand
 resource consumption and throughput from Spark to MarkLogic.
 
+**You should take care** not to exceed the number of requests that your MarkLogic cluster can reasonably handle at a
+given time. A general rule of thumb is not to use more threads than the number of hosts multiplied by the number of
+threads per app server. A MarkLogic app server defaults to a limit of 32 threads. So for a 3-host cluster, you should
+not exceed 96 total threads. This also assumes that each host is receiving requests - either via a load balancer placed
+in front of the MarkLogic cluster, or by setting the `spark.marklogic.client.connectionType` option to `direct` when 
+the connector can directly connect to each host in the cluster. 
+
+The rule of thumb above can thus be expressed as:
+
+    Number of partitions * Value of spark.marklogic.write.threadCount <= Number of hosts * number of app server threads
+
 ### Error handling
 
 The connector may throw an error during one of two phases of operation - before it begins to write data to MarkLogic,
diff --git a/src/main/java/com/marklogic/spark/Options.java b/src/main/java/com/marklogic/spark/Options.java
@@ -44,16 +44,16 @@ public abstract class Options {
     // "categories" as defined by https://docs.marklogic.com/REST/GET/v1/documents .
     public static final String READ_DOCUMENTS_CATEGORIES = "spark.marklogic.read.documents.categories";
     public static final String READ_DOCUMENTS_COLLECTIONS = "spark.marklogic.read.documents.collections";
+    public static final String READ_DOCUMENTS_DIRECTORY = "spark.marklogic.read.documents.directory";
+    public static final String READ_DOCUMENTS_OPTIONS = "spark.marklogic.read.documents.options";
+    public static final String READ_DOCUMENTS_PARTITIONS_PER_FOREST = "spark.marklogic.read.documents.partitionsPerForest";
     // Corresponds to "q" at https://docs.marklogic.com/REST/POST/v1/search, known as a "string query".
-    public static final String READ_DOCUMENTS_STRING_QUERY = "spark.marklogic.read.documents.stringQuery";
     // Corresponds to the complex query submitted via the request body at https://docs.marklogic.com/REST/POST/v1/search .
     public static final String READ_DOCUMENTS_QUERY = "spark.marklogic.read.documents.query";
-    public static final String READ_DOCUMENTS_OPTIONS = "spark.marklogic.read.documents.options";
-    public static final String READ_DOCUMENTS_DIRECTORY = "spark.marklogic.read.documents.directory";
+    public static final String READ_DOCUMENTS_STRING_QUERY = "spark.marklogic.read.documents.stringQuery";
     public static final String READ_DOCUMENTS_TRANSFORM = "spark.marklogic.read.documents.transform";
     public static final String READ_DOCUMENTS_TRANSFORM_PARAMS = "spark.marklogic.read.documents.transformParams";
     public static final String READ_DOCUMENTS_TRANSFORM_PARAMS_DELIMITER = "spark.marklogic.read.documents.transformParamsDelimiter";
-    public static final String READ_DOCUMENTS_PARTITIONS_PER_FOREST = "spark.marklogic.read.documents.partitionsPerForest";
 
     public static final String READ_FILES_COMPRESSION = "spark.marklogic.read.files.compression";