Skip to content

Commit 12c1d54

Browse files
authored
Merge pull request #78 from marklogic/feature/doc-tweaks
Doc tweaks
2 parents 361ea39 + 936eb13 commit 12c1d54

File tree

5 files changed

+51
-48
lines changed

5 files changed

+51
-48
lines changed

docs/configuration.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ options. Each set of options is defined in a separate table below.
1111

1212
These options define how the connector connects and authenticates with MarkLogic.
1313

14-
| Option | Description |
15-
|---------------------------------------------| --- |
14+
| Option | Description |
15+
| --- | --- |
1616
| spark.marklogic.client.host | Required; the host name to connect to; this can be the name of a host in your MarkLogic cluster or the host name of a load balancer. |
1717
| spark.marklogic.client.port | Required; the port of the app server in MarkLogic to connect to. |
1818
| spark.marklogic.client.basePath | Base path to prefix on each request to MarkLogic. |
@@ -83,29 +83,30 @@ describes the other choices for this option.
8383
These options control how the connector reads data from MarkLogic. See [the guide on reading](reading.md) for more
8484
information on how data is read from MarkLogic.
8585

86-
| Option | Description |
87-
| --- |---------------------------------------------------------------------------------------------------|
88-
| spark.marklogic.read.opticQuery | Required; the Optic DSL query to run for retrieving rows; must use `op.fromView` as the accessor. |
89-
| spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to `spark.default.parallelism`. |
86+
| Option | Description |
87+
| --- | --- |
9088
| spark.marklogic.read.batchSize | Approximate number of rows to retrieve in each call to MarkLogic; defaults to 100000. |
89+
| spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to `spark.default.parallelism`. |
90+
| spark.marklogic.read.opticQuery | Required; the Optic DSL query to run for retrieving rows; must use `op.fromView` as the accessor. |
9191
| spark.marklogic.read.pushDownAggregates | Whether to push down aggregate operations to MarkLogic; defaults to `true`. Set to `false` to prevent aggregates from being pushed down to MarkLogic. |
92+
9293
## Write options
9394

9495
These options control how the connector writes data to MarkLogic. See [the guide on writing](writing.md) for more
9596
information on how data is written to MarkLogic.
9697

97-
| Option | Description |
98-
| --- |-----------------------------------------------------------------------------------|
98+
| Option | Description |
99+
| --- | --- |
99100
| spark.marklogic.write.abortOnFailure | Whether the Spark job should abort if a batch fails to be written; defaults to `true`. |
100101
| spark.marklogic.write.batchSize | The number of documents written in a call to MarkLogic; defaults to 100. |
101-
| spark.marklogic.write.collections | Comma-delimited string of collection names to add to each document |
102-
| spark.marklogic.write.permissions | Comma-delimited string of role names and capabilities to add to each document - e.g. role1,read,role2,update,role3,execute |
103-
| spark.marklogic.write.temporalCollection | Name of a temporal collection to assign each document to |
102+
| spark.marklogic.write.collections | Comma-delimited string of collection names to add to each document. |
103+
| spark.marklogic.write.permissions | Comma-delimited string of role names and capabilities to add to each document - e.g. role1,read,role2,update,role3,execute . |
104+
| spark.marklogic.write.temporalCollection | Name of a temporal collection to assign each document to. |
104105
| spark.marklogic.write.threadCount | The number of threads used within each partition to send documents to MarkLogic; defaults to 4. |
105-
| spark.marklogic.write.transform | Name of a REST transform to apply to each document |
106-
| spark.marklogic.write.transformParams | Comma-delimited string of transform parameter names and values - e.g. param1,value1,param2,value2 |
107-
| spark.marklogic.write.transformParamsDelimiter | Delimiter to use instead of a command for the `transformParams` option |
108-
| spark.marklogic.write.uriPrefix | String to prepend to each document URI, where the URI defaults to a UUID |
109-
| spark.marklogic.write.uriSuffix | String to append to each document URI, where the URI defaults to a UUID |
106+
| spark.marklogic.write.transform | Name of a REST transform to apply to each document. |
107+
| spark.marklogic.write.transformParams | Comma-delimited string of transform parameter names and values - e.g. param1,value1,param2,value2 . |
108+
| spark.marklogic.write.transformParamsDelimiter | Delimiter to use instead of a command for the `transformParams` option. |
109+
| spark.marklogic.write.uriPrefix | String to prepend to each document URI, where the URI defaults to a UUID. |
110+
| spark.marklogic.write.uriSuffix | String to append to each document URI, where the URI defaults to a UUID. |
110111
| spark.marklogic.write.uriTemplate | String defining a template for constructing each document URI. See [Writing data](writing.md) for more information. |
111112

docs/getting-started/pyspark.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,8 @@ obtaining the connector and deploying an example application to MarkLogic.
1616

1717
The [PySpark installation guide](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) describes
1818
how to install PySpark. As noted in that guide, you will need to install Python 3 first if you do not already have it
19-
installed. [pyenv](https://github.com/pyenv/pyenv#installation) is recommended for doing so, as it simplifies
20-
installing multiple versions of Python and easily switching between them. You are free though to install Python 3 in
21-
any manner you wish.
19+
installed. [pyenv](https://github.com/pyenv/pyenv#installation) is a popular choice for doing so, as it simplifies
20+
installing and switching between multiple versions of Python.
2221

2322
Once you have installed PySpark, run the following from a command line to ensure PySpark is installed correctly:
2423

@@ -43,7 +42,7 @@ When PySpark starts, you should see information like this on how to configure lo
4342
Setting default log level to "WARN".
4443
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
4544

46-
Setting the default log level to "INFO" or "DEBUG" will show logging from the MarkLogic Spark connector. This will also
45+
Setting the default log level to `INFO` or `DEBUG` will show logging from the MarkLogic Spark connector. This will also
4746
include potentially significant amounts of log messages from PySpark itself.
4847

4948
### Reading data with the connector
@@ -81,13 +80,13 @@ The [PySpark docs](https://spark.apache.org/docs/latest/api/python/getting_start
8180
information on how a Spark DataFrame works along with more commands that you can try on it.
8281

8382
The instructions above can be applied to your own MarkLogic application. You can use the same Spark command above,
84-
simply adjusting the connection details and the Optic DSL query. Please see
83+
simply adjusting the connection details and the Optic query. Please see
8584
[the guide on reading data](../reading.md) for more information on how data can be read from MarkLogic.
8685

8786
### Writing data to the connector
8887

8988
The connector writes the rows in a Spark DataFrame to MarkLogic as new JSON documents, which can also be transformed
90-
into XML documents if desired. To try this on the DataFrame that was read from MarkLogic in the above section,
89+
into XML documents. To try this on the DataFrame that was read from MarkLogic in the above section,
9190
paste the following into PySpark, adjusting the host and password values as needed:
9291

9392
```
@@ -100,8 +99,9 @@ df.write.format("com.marklogic.spark") \
10099
.save()
101100
```
102101

103-
To examine the results, access your MarkLogic server's qconsole tool again and click on the "Explore" button for the
104-
`spark-example-content` database. The database should now have 2,000 documents - the 1,000 documents in the
102+
To examine the results, access your [MarkLogic server's qconsole tool](https://docs.marklogic.com/guide/qconsole/intro)
103+
and click on the "Explore" button for the `spark-example-content` database. The database should now have
104+
2,000 documents - the 1,000 documents in the
105105
`employee` collection that were loaded when the application was deployed, and the 1,000 documents in the
106106
`write-test` collection that were written by the PySpark command above. Each document in the `write-test` collection
107107
will have field names based on the column names in the Spark DataFrame.

docs/getting-started/setup.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ environment's documentation on how to achieve this.
1919
## Deploy an example application
2020

2121
The connector allows a user to specify an
22-
[Optic DSL query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) to select rows to retrieve from
22+
[Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) to select rows to retrieve from
2323
MarkLogic. The query depends on a [MarkLogic view](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_68685) that
2424
projects data from documents in MarkLogic into rows.
2525

@@ -42,14 +42,14 @@ MarkLogic server that includes a
4242

4343
After the deployment finishes, your MarkLogic server will now have the following:
4444

45-
- An app server named `spark-example` listening on port 8020 (or the port you chose if you overrode the `mlPort`
45+
- An app server named `spark-example` listening on port 8020 (or the port you chose if you modified the `mlPort`
4646
property).
4747
- A database named `spark-example-content` that contains 1000 JSON documents in the `employee` collection.
4848
- A TDE with a schema name of `example` and a view name of `employee`.
49-
- A user named `spark-example-user` that can be used with the Spark connector and in MarkLogic's qconsole tool.
49+
- A user named `spark-example-user` that can be used with the Spark connector and [MarkLogic's qconsole tool](https://docs.marklogic.com/guide/qconsole/intro).
5050

5151
To verify that your application was deployed correctly, access your MarkLogic server's qconsole tool - for example,
52-
if your MarkLogic server is deployed locally, you will go to http://localhost:8000/qconsole . You can authenticate as
52+
if your MarkLogic server is deployed locally, you will go to <http://localhost:8000/qconsole> . You can authenticate as
5353
the `spark-example-user` user that was created above, as it's generally preferable to test as a non-admin user.
5454
After authenticating, perform the following steps:
5555

docs/reading.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ nav_order: 3
55
---
66

77
The MarkLogic Spark connector allows for data to be retrieved from MarkLogic as rows via an
8-
[Optic DSL query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710). The
8+
[Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710). The
99
sections below provide more detail on configuring how data is retrieved and converted into a Spark DataFrame.
1010

1111
## Basic read operation
@@ -21,7 +21,7 @@ df = spark.read.format("com.marklogic.spark") \
2121
```
2222

2323
As shown above, `format`, `spark.marklogic.client.uri` (or the other `spark.marklogic.client` options
24-
that can be used to define the connection details), and `spark.marklogic.read.opticQuery` are always required. The
24+
that can be used to define the connection details), and `spark.marklogic.read.opticQuery` are required. The
2525
following sections provide more details about these and other options that can be set.
2626

2727
## Optic query requirements
@@ -120,7 +120,9 @@ to provide meaningful context when an error occurs to assist with debugging the
120120

121121
In practice, it is expected that most errors will be a result of a misconfiguration. For example, the connection and
122122
authentication options may be incorrect, or the Optic query may have a syntax error. Any errors that cannot be
123-
fixed via changes to the options passed to the connector should be reported as new issues to this GitHub repository.
123+
fixed via changes to the options passed to the connector should be
124+
[reported as new issues](https://github.com/marklogic/marklogic-spark-connector/issues) in the connector's GitHub
125+
repository.
124126

125127
## Pushing down operations
126128

docs/writing.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ convert the JSON document into an XML document, which then can be further modifi
4040

4141
Parameters can be passed to your REST transform via the `spark.marklogic.write.transformParams` option. The value of
4242
this option must be a comma-delimited string of the form `param1,value1,param2,value,etc`. For example, if your
43-
transform accepts parameters named "color" and "size", the following option would pass values to the transform for
43+
transform accepts parameters named "color" and "size", the following options would pass values to the transform for
4444
those parameter names:
4545

4646
.option("spark.marklogic.write.transform", "my-transform")
@@ -57,8 +57,8 @@ the parameter values contains a comma:
5757
## Configuring document URIs
5858

5959
By default, the connector will construct a URI for each document beginning with a UUID and ending with `.json`. A
60-
prefix can be specified via `spark.marklogic.write.uriPrefix`, and the default suffix of `.json` can be overridden
61-
via `spark.marklogic.write.uriSuffix`. For example, the following options would results in URIs of the form
60+
prefix can be specified via `spark.marklogic.write.uriPrefix`, and the default suffix of `.json` can be modified
61+
via `spark.marklogic.write.uriSuffix`. For example, the following options would result in URIs of the form
6262
"/employee/(a random UUID value)/record.json":
6363

6464
.option("spark.marklogic.write.uriPrefix", "/employee/")
@@ -68,7 +68,7 @@ URIs can also be constructed based on column values for a given row. The `spark.
6868
allows for column names to be referenced via braces when constructing a URI. If this option is used, the
6969
above options for setting a prefix and suffix will be ignored, as the template can be used to define the entire URI.
7070

71-
For example, consider a Spark DataFrame with, among other columns, columns named `organization` and `employee_id`.
71+
For example, consider a Spark DataFrame with a set of columns including `organization` and `employee_id`.
7272
The following template would construct URIs based on those two columns:
7373

7474
.option("spark.marklogic.write.uriTemplate", "/example/{organization}/{employee_id}.json")
@@ -122,7 +122,7 @@ spark.readStream \
122122
.format("csv") \
123123
.schema(StructType([StructField("GivenName", StringType()), StructField("Surname", StringType())])) \
124124
.option("header", True) \
125-
.load("data/csv-files") \
125+
.load("examples/getting-started/data/csv-files") \
126126
.writeStream \
127127
.format("com.marklogic.spark") \
128128
.option("checkpointLocation", tempfile.mkdtemp()) \
@@ -143,36 +143,35 @@ sources and writing it directly to MarkLogic.
143143
## Error handling
144144

145145
The connector may throw an error during one of two phases of operation - before it begins to write data to MarkLogic,
146-
and during the writing of data to MarkLogic.
146+
and during the writing of a batch of documents to MarkLogic.
147147

148148
For the first kind of error, the error will be immediately returned to the user and no data will have been written.
149-
Such errors are often due to misconfiguration of the connector options and should be fixable.
149+
Such errors are often due to misconfiguration of the connector options.
150150

151151
For the second kind of error, the connector defaults to logging the error and asking Spark to abort the entire write
152152
operation. Any batches of documents that were written successfully prior to the error occurring will still exist in the
153153
database. To configure the connector to only log the error and continue writing batches of documents to MarkLogic, set
154154
the `spark.marklogic.write.abortOnFailure` option to a value of `false`.
155155

156156
Similar to errors with reading data, the connector will strive to provide meaningful context when an error occurs to
157-
assist with debugging the cause of the error. Any errors that cannot be fixed via changes to the options passed to the
158-
connector should be reported as new issues to this GitHub repository.
157+
assist with debugging the cause of the error.
159158

160159
## Tuning performance
161160

162161
The MarkLogic Spark connector uses MarkLogic's
163162
[Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement) for writing documents to a database. The
164163
following options can be set to adjust how the connector performs when writing data:
165164

166-
- `spark.marklogic.write.batchSize` = the number of documents written in one call to MarkLogic; defaults to 100
165+
- `spark.marklogic.write.batchSize` = the number of documents written in one call to MarkLogic; defaults to 100.
167166
- `spark.marklogic.write.threadCount` = the number of threads used by each partition to write documents to MarkLogic;
168-
defaults to 4
167+
defaults to 4.
169168

170169
These options are in addition to the number of partitions within the Spark DataFrame that is being written to
171170
MarkLogic. For each partition in the DataFrame, a separate instance of a MarkLogic batch writer is created, each
172171
with its own set of threads.
173172

174173
Optimizing performance will thus involve testing various combinations of partition counts, batch sizes, and thread
175-
counts. The [MarkLogic Monitoring tools](https://docs.marklogic.com/guide/monitoring/intro) can help you understand
174+
counts. The [MarkLogic Monitoring tool](https://docs.marklogic.com/guide/monitoring/intro) can help you understand
176175
resource consumption and throughput from Spark to MarkLogic.
177176

178177
## Supported save modes
@@ -182,8 +181,9 @@ Spark supports
182181
when writing data. The MarkLogic Spark connector requires the `append` mode to be used. Because Spark defaults to
183182
the `error` mode, you will need to set this to `append` each time you use the connector to write data.
184183

185-
`append` is the only supported mode because MarkLogic does not have the concept of a "table" that a document
186-
must belong to, and only belong to one of. The Spark save modes give a user control over how data is written based
187-
on whether the target table exists. Because no such concept of a table exists in MarkLogic, the differences between
188-
the various modes do not apply to MarkLogic. Note that while a collection in MarkLogic has some similarities to a
189-
table, it is fundamentally different in that a document can belong to zero to many collections.
184+
`append` is the only supported mode due to MarkLogic not having the concept of a single "table" that a document
185+
must belong to. The Spark save modes give a user control over how data is written based
186+
on whether the target table exists. Because the concept of a rigid table does not exist in MarkLogic, the differences
187+
between the various modes do not apply to MarkLogic. Note that while a collection in MarkLogic has some similarities to
188+
a table, it is fundamentally different in that a document can belong to zero to many collections and collections do not
189+
impose any schema constraints.

0 commit comments

Comments
 (0)