Merge pull request #78 from marklogic/feature/doc-tweaks

rjrudin · web-flow · commit 12c1d5462d97 · 2023-06-14T09:25:14.000-07:00
Doc tweaks
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -11,8 +11,8 @@ options. Each set of options is defined in a separate table below.
 
 These options define how the connector connects and authenticates with MarkLogic.
 
-| Option                                      | Description |
-|---------------------------------------------| --- |
+| Option | Description | 
+| --- | --- |
 | spark.marklogic.client.host                 | Required; the host name to connect to; this can be the name of a host in your MarkLogic cluster or the host name of a load balancer. |
 | spark.marklogic.client.port                 | Required; the port of the app server in MarkLogic to connect to. |
 | spark.marklogic.client.basePath             | Base path to prefix on each request to MarkLogic. |
@@ -83,29 +83,30 @@ describes the other choices for this option.
 These options control how the connector reads data from MarkLogic. See [the guide on reading](reading.md) for more 
 information on how data is read from MarkLogic.
 
-| Option | Description                                                                                       | 
-| --- |---------------------------------------------------------------------------------------------------|
-| spark.marklogic.read.opticQuery | Required; the Optic DSL query to run for retrieving rows; must use `op.fromView` as the accessor. |
-| spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to `spark.default.parallelism`. |
+| Option | Description | 
+| --- | --- |
 | spark.marklogic.read.batchSize | Approximate number of rows to retrieve in each call to MarkLogic; defaults to 100000. |
+| spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to `spark.default.parallelism`. |
+| spark.marklogic.read.opticQuery | Required; the Optic DSL query to run for retrieving rows; must use `op.fromView` as the accessor. |
 | spark.marklogic.read.pushDownAggregates | Whether to push down aggregate operations to MarkLogic; defaults to `true`. Set to `false` to prevent aggregates from being pushed down to MarkLogic. |
+
 ## Write options
 
 These options control how the connector writes data to MarkLogic. See [the guide on writing](writing.md) for more 
 information on how data is written to MarkLogic.
 
-| Option | Description                                                                       | 
-| --- |-----------------------------------------------------------------------------------|
+| Option | Description | 
+| --- | --- |
 | spark.marklogic.write.abortOnFailure | Whether the Spark job should abort if a batch fails to be written; defaults to `true`. |
 | spark.marklogic.write.batchSize | The number of documents written in a call to MarkLogic; defaults to 100. |
-| spark.marklogic.write.collections | Comma-delimited string of collection names to add to each document |
-| spark.marklogic.write.permissions | Comma-delimited string of role names and capabilities to add to each document - e.g. role1,read,role2,update,role3,execute |
-| spark.marklogic.write.temporalCollection | Name of a temporal collection to assign each document to |
+| spark.marklogic.write.collections | Comma-delimited string of collection names to add to each document. |
+| spark.marklogic.write.permissions | Comma-delimited string of role names and capabilities to add to each document - e.g. role1,read,role2,update,role3,execute . |
+| spark.marklogic.write.temporalCollection | Name of a temporal collection to assign each document to. |
 | spark.marklogic.write.threadCount | The number of threads used within each partition to send documents to MarkLogic; defaults to 4. |
-| spark.marklogic.write.transform | Name of a REST transform to apply to each document |
-| spark.marklogic.write.transformParams | Comma-delimited string of transform parameter names and values - e.g. param1,value1,param2,value2 |
-| spark.marklogic.write.transformParamsDelimiter | Delimiter to use instead of a command for the `transformParams` option |
-| spark.marklogic.write.uriPrefix | String to prepend to each document URI, where the URI defaults to a UUID |
-| spark.marklogic.write.uriSuffix | String to append to each document URI, where the URI defaults to a UUID |
+| spark.marklogic.write.transform | Name of a REST transform to apply to each document. |
+| spark.marklogic.write.transformParams | Comma-delimited string of transform parameter names and values - e.g. param1,value1,param2,value2 . |
+| spark.marklogic.write.transformParamsDelimiter | Delimiter to use instead of a command for the `transformParams` option. |
+| spark.marklogic.write.uriPrefix | String to prepend to each document URI, where the URI defaults to a UUID. |
+| spark.marklogic.write.uriSuffix | String to append to each document URI, where the URI defaults to a UUID. |
 | spark.marklogic.write.uriTemplate | String defining a template for constructing each document URI. See [Writing data](writing.md) for more information. |
 
diff --git a/docs/getting-started/pyspark.md b/docs/getting-started/pyspark.md
@@ -16,9 +16,8 @@ obtaining the connector and deploying an example application to MarkLogic.
 
 The [PySpark installation guide](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) describes
 how to install PySpark. As noted in that guide, you will need to install Python 3 first if you do not already have it
-installed. [pyenv](https://github.com/pyenv/pyenv#installation) is recommended for doing so, as it simplifies
-installing multiple versions of Python and easily switching between them. You are free though to install Python 3 in
-any manner you wish.
+installed. [pyenv](https://github.com/pyenv/pyenv#installation) is a popular choice for doing so, as it simplifies
+installing and switching between multiple versions of Python.
 
 Once you have installed PySpark, run the following from a command line to ensure PySpark is installed correctly:
 
@@ -43,7 +42,7 @@ When PySpark starts, you should see information like this on how to configure lo
     Setting default log level to "WARN".
     To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
 
-Setting the default log level to "INFO" or "DEBUG" will show logging from the MarkLogic Spark connector. This will also
+Setting the default log level to `INFO` or `DEBUG` will show logging from the MarkLogic Spark connector. This will also
 include potentially significant amounts of log messages from PySpark itself.
 
 ### Reading data with the connector
@@ -81,13 +80,13 @@ The [PySpark docs](https://spark.apache.org/docs/latest/api/python/getting_start
 information on how a Spark DataFrame works along with more commands that you can try on it.
 
 The instructions above can be applied to your own MarkLogic application. You can use the same Spark command above,
-simply adjusting the connection details and the Optic DSL query. Please see 
+simply adjusting the connection details and the Optic query. Please see 
 [the guide on reading data](../reading.md) for more information on how data can be read from MarkLogic.
 
 ### Writing data to the connector
 
 The connector writes the rows in a Spark DataFrame to MarkLogic as new JSON documents, which can also be transformed
-into XML documents if desired. To try this on the DataFrame that was read from MarkLogic in the above section,
+into XML documents. To try this on the DataFrame that was read from MarkLogic in the above section,
 paste the following into PySpark, adjusting the host and password values as needed:
 
 ```
@@ -100,8 +99,9 @@ df.write.format("com.marklogic.spark") \
     .save()
 ```
 
-To examine the results, access your MarkLogic server's qconsole tool again and click on the "Explore" button for the
-`spark-example-content` database. The database should now have 2,000 documents - the 1,000 documents in the
+To examine the results, access your [MarkLogic server's qconsole tool](https://docs.marklogic.com/guide/qconsole/intro) 
+and click on the "Explore" button for the `spark-example-content` database. The database should now have 
+2,000 documents - the 1,000 documents in the
 `employee` collection that were loaded when the application was deployed, and the 1,000 documents in the
 `write-test` collection that were written by the PySpark command above. Each document in the `write-test` collection
 will have field names based on the column names in the Spark DataFrame.
diff --git a/docs/getting-started/setup.md b/docs/getting-started/setup.md
@@ -19,7 +19,7 @@ environment's documentation on how to achieve this.
 ## Deploy an example application
 
 The connector allows a user to specify an
-[Optic DSL query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) to select rows to retrieve from
+[Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) to select rows to retrieve from
 MarkLogic. The query depends on a [MarkLogic view](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_68685) that
 projects data from documents in MarkLogic into rows.
 
@@ -42,14 +42,14 @@ MarkLogic server that includes a
 
 After the deployment finishes, your MarkLogic server will now have the following:
 
-- An app server named `spark-example` listening on port 8020 (or the port you chose if you overrode the `mlPort`
+- An app server named `spark-example` listening on port 8020 (or the port you chose if you modified the `mlPort`
   property).
 - A database named `spark-example-content` that contains 1000 JSON documents in the `employee` collection.
 - A TDE with a schema name of `example` and a view name of `employee`.
-- A user named `spark-example-user` that can be used with the Spark connector and in MarkLogic's qconsole tool.
+- A user named `spark-example-user` that can be used with the Spark connector and [MarkLogic's qconsole tool](https://docs.marklogic.com/guide/qconsole/intro).
 
 To verify that your application was deployed correctly, access your MarkLogic server's qconsole tool - for example,
-if your MarkLogic server is deployed locally, you will go to http://localhost:8000/qconsole . You can authenticate as 
+if your MarkLogic server is deployed locally, you will go to <http://localhost:8000/qconsole> . You can authenticate as 
 the `spark-example-user` user that was created above, as it's generally preferable to test as a non-admin user. 
 After authenticating, perform the following steps:
 
diff --git a/docs/reading.md b/docs/reading.md
@@ -5,7 +5,7 @@ nav_order: 3
 ---
 
 The MarkLogic Spark connector allows for data to be retrieved from MarkLogic as rows via an 
-[Optic DSL query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710). The 
+[Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710). The 
 sections below provide more detail on configuring how data is retrieved and converted into a Spark DataFrame.
 
 ## Basic read operation
@@ -21,7 +21,7 @@ df = spark.read.format("com.marklogic.spark") \
 ```
 
 As shown above, `format`, `spark.marklogic.client.uri` (or the other `spark.marklogic.client` options
-that can be used to define the connection details), and `spark.marklogic.read.opticQuery` are always required. The 
+that can be used to define the connection details), and `spark.marklogic.read.opticQuery` are required. The 
 following sections provide more details about these and other options that can be set. 
 
 ## Optic query requirements
@@ -120,7 +120,9 @@ to provide meaningful context when an error occurs to assist with debugging the
 
 In practice, it is expected that most errors will be a result of a misconfiguration. For example, the connection and 
 authentication options may be incorrect, or the Optic query may have a syntax error. Any errors that cannot be 
-fixed via changes to the options passed to the connector should be reported as new issues to this GitHub repository.
+fixed via changes to the options passed to the connector should be 
+[reported as new issues](https://github.com/marklogic/marklogic-spark-connector/issues) in the connector's GitHub
+repository.
 
 ## Pushing down operations
 
diff --git a/docs/writing.md b/docs/writing.md
@@ -40,7 +40,7 @@ convert the JSON document into an XML document, which then can be further modifi
 
 Parameters can be passed to your REST transform via the `spark.marklogic.write.transformParams` option. The value of 
 this option must be a comma-delimited string of the form `param1,value1,param2,value,etc`. For example, if your 
-transform accepts parameters named "color" and "size", the following option would pass values to the transform for 
+transform accepts parameters named "color" and "size", the following options would pass values to the transform for 
 those parameter names:
 
     .option("spark.marklogic.write.transform", "my-transform")
@@ -57,8 +57,8 @@ the parameter values contains a comma:
 ## Configuring document URIs
 
 By default, the connector will construct a URI for each document beginning with a UUID and ending with `.json`. A 
-prefix can be specified via `spark.marklogic.write.uriPrefix`, and the default suffix of `.json` can be overridden 
-via `spark.marklogic.write.uriSuffix`. For example, the following options would results in URIs of the form 
+prefix can be specified via `spark.marklogic.write.uriPrefix`, and the default suffix of `.json` can be modified 
+via `spark.marklogic.write.uriSuffix`. For example, the following options would result in URIs of the form 
 "/employee/(a random UUID value)/record.json":
 
     .option("spark.marklogic.write.uriPrefix", "/employee/")
@@ -68,7 +68,7 @@ URIs can also be constructed based on column values for a given row. The `spark.
 allows for column names to be referenced via braces when constructing a URI. If this option is used, the 
 above options for setting a prefix and suffix will be ignored, as the template can be used to define the entire URI. 
 
-For example, consider a Spark DataFrame with, among other columns, columns named `organization` and `employee_id`. 
+For example, consider a Spark DataFrame with a set of columns including `organization` and `employee_id`. 
 The following template would construct URIs based on those two columns:
 
     .option("spark.marklogic.write.uriTemplate", "/example/{organization}/{employee_id}.json")
@@ -122,7 +122,7 @@ spark.readStream \
     .format("csv") \
     .schema(StructType([StructField("GivenName", StringType()), StructField("Surname", StringType())])) \
     .option("header", True) \
-    .load("data/csv-files") \
+    .load("examples/getting-started/data/csv-files") \
     .writeStream \
     .format("com.marklogic.spark") \
     .option("checkpointLocation", tempfile.mkdtemp()) \
@@ -143,36 +143,35 @@ sources and writing it directly to MarkLogic.
 ## Error handling
 
 The connector may throw an error during one of two phases of operation - before it begins to write data to MarkLogic, 
-and during the writing of data to MarkLogic. 
+and during the writing of a batch of documents to MarkLogic. 
 
 For the first kind of error, the error will be immediately returned to the user and no data will have been written. 
-Such errors are often due to misconfiguration of the connector options and should be fixable. 
+Such errors are often due to misconfiguration of the connector options. 
 
 For the second kind of error, the connector defaults to logging the error and asking Spark to abort the entire write 
 operation. Any batches of documents that were written successfully prior to the error occurring will still exist in the 
 database. To configure the connector to only log the error and continue writing batches of documents to MarkLogic, set 
 the `spark.marklogic.write.abortOnFailure` option to a value of `false`. 
 
 Similar to errors with reading data, the connector will strive to provide meaningful context when an error occurs to 
-assist with debugging the cause of the error. Any errors that cannot be fixed via changes to the options passed to the 
-connector should be reported as new issues to this GitHub repository.
+assist with debugging the cause of the error.
 
 ## Tuning performance
 
 The MarkLogic Spark connector uses MarkLogic's 
 [Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement) for writing documents to a database. The 
 following options can be set to adjust how the connector performs when writing data:
 
-- `spark.marklogic.write.batchSize` = the number of documents written in one call to MarkLogic; defaults to 100
+- `spark.marklogic.write.batchSize` = the number of documents written in one call to MarkLogic; defaults to 100.
 - `spark.marklogic.write.threadCount` = the number of threads used by each partition to write documents to MarkLogic;
-  defaults to 4
+  defaults to 4.
 
 These options are in addition to the number of partitions within the Spark DataFrame that is being written to 
 MarkLogic. For each partition in the DataFrame, a separate instance of a MarkLogic batch writer is created, each 
 with its own set of threads. 
 
 Optimizing performance will thus involve testing various combinations of partition counts, batch sizes, and thread 
-counts. The [MarkLogic Monitoring tools](https://docs.marklogic.com/guide/monitoring/intro) can help you understand 
+counts. The [MarkLogic Monitoring tool](https://docs.marklogic.com/guide/monitoring/intro) can help you understand 
 resource consumption and throughput from Spark to MarkLogic. 
 
 ## Supported save modes
@@ -182,8 +181,9 @@ Spark supports
 when writing data. The MarkLogic Spark connector requires the `append` mode to be used. Because Spark defaults to 
 the `error` mode, you will need to set this to `append` each time you use the connector to write data. 
 
-`append` is the only supported mode because MarkLogic does not have the concept of a "table" that a document 
-must belong to, and only belong to one of. The Spark save modes give a user control over how data is written based 
-on whether the target table exists. Because no such concept of a table exists in MarkLogic, the differences between 
-the various modes do not apply to MarkLogic. Note that while a collection in MarkLogic has some similarities to a 
-table, it is fundamentally different in that a document can belong to zero to many collections. 
+`append` is the only supported mode due to MarkLogic not having the concept of a single "table" that a document 
+must belong to. The Spark save modes give a user control over how data is written based 
+on whether the target table exists. Because the concept of a rigid table does not exist in MarkLogic, the differences 
+between the various modes do not apply to MarkLogic. Note that while a collection in MarkLogic has some similarities to 
+a table, it is fundamentally different in that a document can belong to zero to many collections and collections do not
+impose any schema constraints.