Skip to content

Commit 74ab928

Browse files
committed
MLE-12420 Docs for 2.2.0
Not doing the file reading/writing stuff yet.
1 parent 3d157e2 commit 74ab928

File tree

5 files changed

+95
-24
lines changed

5 files changed

+95
-24
lines changed

docs/configuration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,15 +145,15 @@ The following options control how the connector reads document rows from MarkLog
145145
| Option | Description |
146146
| --- | --- |
147147
| spark.marklogic.read.documents.stringQuery | A [MarkLogic string query](https://docs.marklogic.com/guide/search-dev/string-query) for selecting documents. |
148-
| spark.marklogic.read.documents.query | A JSON or XML representation of a structured query, serialized CTS query, or combined query. |
148+
| spark.marklogic.read.documents.query | A JSON or XML representation of a [structured query](https://docs.marklogic.com/guide/search-dev/structured-query#), [serialized CTS query](https://docs.marklogic.com/guide/rest-dev/search#id_30577), or [combined query](https://docs.marklogic.com/guide/rest-dev/search#id_69918). |
149149
| spark.marklogic.read.documents.categories | Controls which metadata is returned for each document. Defaults to `content`. Allowable values are `content`, `metadata`, `collections`, `permissions`, `quality`, `properties`, and `metadatavalues`. |
150150
| spark.marklogic.read.documents.collections | Comma-delimited string of zero to many collections to constrain the query. |
151151
| spark.marklogic.read.documents.directory | Database directory - e.g. "/company/employees/" - to constrain the query. |
152152
| spark.marklogic.read.documents.options | Name of a set of [MarkLogic search options](https://docs.marklogic.com/guide/search-dev/query-options) to be applied against a string query. |
153+
| spark.marklogic.read.documents.partitionsPerForest | Number of Spark partition readers to create per forest; defaults to 4. |
153154
| spark.marklogic.read.documents.transform | Name of a [MarkLogic REST transform](https://docs.marklogic.com/guide/rest-dev/transforms) to apply to each matching document. |
154155
| spark.marklogic.read.documents.transformParams | Comma-delimited sequence of transform parameter names and values - e.g. `param1,value1,param2,value`. |
155156
| spark.marklogic.read.documents.transformParamsDelimiter | Delimiter for transform parameters; defaults to a comma. |
156-
| spark.marklogic.read.documents.partitionsPerForest | Number of Spark partition readers to create per forest; defaults to 4. |
157157

158158
## Write options
159159

docs/getting-started/pyspark.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,17 @@ The `df` variable is an instance of a Spark DataFrame. Try the following command
7878
The [PySpark docs](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html) provide more
7979
information on how a Spark DataFrame works along with more commands that you can try on it.
8080

81+
As of the connector 2.2.0 release, you can also query for documents, receiving "document" rows that contain columns
82+
capturing the URI, content, and metadata for each document:
83+
84+
```
85+
df = spark.read.format("marklogic") \
86+
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
87+
.option("spark.marklogic.read.documents.collections", "employee") \
88+
.load()
89+
df.show()
90+
```
91+
8192
The instructions above can be applied to your own MarkLogic application. You can use the same Spark command above,
8293
simply adjusting the connection details and the Optic query. Please see
8394
[the guide on reading data](../reading-data/reading.md) for more information on how data can be read from MarkLogic,

docs/reading-data/documents.md

Lines changed: 39 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,12 @@ when data needs to be retrieved and an [Optic query](optic.md) is not a practica
1818

1919
## Usage
2020

21-
This will be cleaned up before the 2.2.0 release, just getting the basics in place.
21+
To read documents from MarkLogic, you must specify at least one of 4 supported query types described below - a string
22+
query; a structured, serialized CTS, or combined query; a collection query; or a directory query. You may specify any
23+
combination of those 4 query types as well.
2224

23-
General approach is to specify any combination of a string query, a complex query, collections, and a directory. A
24-
string query is configured via `spark.marklogic.read.documents.stringQuery`:
25+
You can specify a [string query](https://docs.marklogic.com/guide/search-dev/string-query) that utilizes
26+
MarkLogic's search grammar via the `spark.marklogic.read.documents.stringQuery` option:
2527

2628
```
2729
df = spark.read.format("marklogic") \
@@ -31,6 +33,9 @@ df = spark.read.format("marklogic") \
3133
df.show()
3234
```
3335

36+
The document content is in a column named `content` of type `binary`. See further below for an example of how to use
37+
common Spark functions to cast this value to a string or parse it into a JSON object.
38+
3439
You can also submit a [structured query](https://docs.marklogic.com/guide/search-dev/structured-query#), a
3540
[serialized CTS query](https://docs.marklogic.com/guide/rest-dev/search#id_30577), or a
3641
[combined query](https://docs.marklogic.com/guide/rest-dev/search#id_69918) via
@@ -43,13 +48,15 @@ df = spark.read.format("marklogic") \
4348
.option("spark.marklogic.read.documents.query", '{"query": {"queries": [{"term-query": {"text": ["Engineering"]} }] } }') \
4449
.load()
4550
df.show()
51+
df.count()
4652
4753
# Serialized CTS query
4854
df = spark.read.format("marklogic") \
4955
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
5056
.option("spark.marklogic.read.documents.query", '{"ctsquery": {"wordQuery": {"text": "Engineering"}}}') \
5157
.load()
5258
df.show()
59+
df.count()
5360
5461
# Combined query
5562
query = "<search xmlns='http://marklogic.com/appservices/search'>\
@@ -61,6 +68,7 @@ df = spark.read.format("marklogic") \
6168
.option("spark.marklogic.read.documents.query", query) \
6269
.load()
6370
df.show()
71+
df.count()
6472
```
6573

6674
## Querying by collections
@@ -73,6 +81,7 @@ df = spark.read.format("marklogic") \
7381
.option("spark.marklogic.read.documents.collections", "employee") \
7482
.load()
7583
df.show()
84+
df.count()
7685
```
7786

7887
You can also specify collections with any of the above queries:
@@ -84,6 +93,7 @@ df = spark.read.format("marklogic") \
8493
.option("spark.marklogic.read.documents.stringQuery", "Marketing") \
8594
.load()
8695
df.show()
96+
df.count()
8797
```
8898

8999
## Querying by directory
@@ -96,20 +106,23 @@ df = spark.read.format("marklogic") \
96106
.option("spark.marklogic.read.documents.directory", "/employee/") \
97107
.load()
98108
df.show()
109+
df.count()
99110
```
100111

101112
## Using query options
102113

103114
If you have a set of [MarkLogic query options](https://docs.marklogic.com/guide/search-dev/query-options) installed in
104-
your REST API app server, you can reference these via `spark.marklogic.read.documents.options`.
115+
your REST API app server, you can reference these via `spark.marklogic.read.documents.options`. You will then typically
116+
use the `spark.marklogic.read.documents.stringQuery` option and reference one or more constraints defined in your
117+
query options.
105118

106119
## Requesting document metadata
107120

108-
By default, each document row will only have its `URI`, `content`, and `format` columns populated. You can use the
121+
By default, each row will only have its `URI`, `content`, and `format` columns populated. You can use the
109122
`spark.marklogic.read.documents.categories` option to request metadata for each document. The value of the option
110123
must be a comma-delimited list of one or more of the following values:
111124

112-
- `content` will result in the `content` and `format` columns being populated. If excluded, neither will be populated.
125+
- `content` will result in the `content` and `format` columns being populated. If excluded from the option value, neither will be populated.
113126
- `metadata` will result in all metadata columns - collections, permissions, quality, properties, and metadata values -
114127
being populated.
115128
- `collections`, `permissions`, `quality`, `properties`, and `metadatavalues` can be used to request each metadata type
@@ -134,7 +147,15 @@ df.show(2)
134147
+--------------------+--------------------+------+-----------+--------------------+-------+----------+--------------+
135148
```
136149

137-
A value of `collections,permissions`
150+
Note that the Spark `show()` function allows for results to be displayed in a vertical format instead of in a table.
151+
You can more easily see values in the metadata columns by requesting a vertical format and dropping the `content` column:
152+
153+
```
154+
df.drop("content").show(2, 0, True)
155+
```
156+
157+
A value of `collections,permissions` will result in the `content` and `format` columns being empty and the `collections`
158+
and `permissions` columns being populated:
138159

139160
```
140161
df = spark.read.format("marklogic") \
@@ -184,16 +205,20 @@ doc = json.loads(df2.head()['content'])
184205
doc['Department']
185206
```
186207

187-
## Understanding performance
208+
## Tuning performance
188209

189210
The connector mimics the behavior of the [MarkLogic Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement)
190211
by creating Spark partition readers that are assigned to a specific forest. By default, the connector will create
191-
4 readers per forest. You can use the `spark.marklogic.read.documents.partitionsPerForest` option to control
192-
the number of readers. You should adjust this based on your cluster configuration. For example,a default REST API app
193-
server will have 32 server threads and 3 forests per host. 4 partition readers will thus consume 12 of the 32 server
194-
threads. If the app server is not servicing any other requests, performance will typically be improved by configuring
195-
8 partitions per forest. Note that the `spark.marklogic.read.numPartitions` option does not have any impact;
196-
that is only used when reading via an Optic query.
212+
4 readers per forest. Each reader will read URIs and documents in a specific range of URIs at a specific MarkLogic
213+
server timestamp, ensuring both that every matching document is retrieved and that the same document is never returned
214+
more than once for a query.
215+
216+
You can use the `spark.marklogic.read.documents.partitionsPerForest` option to control the number of readers. You
217+
should adjust this based on your cluster configuration. For example, a default REST API app server will have 32 server
218+
threads and 3 forests per host. 4 partition readers will thus utilize 12 of the 32 server threads. If the app server
219+
is not servicing any other requests, performance will typically be improved by configuring 8 partitions per forest.
220+
Note that the `spark.marklogic.read.numPartitions` option does not have any impact; that is only used when reading
221+
via an Optic query.
197222

198223
Each partition reader will make one to many calls to MarkLogic to retrieve documents. The
199224
`spark.marklogic.read.batchSize` option controls how many documents will be retrieved in a call. The value defaults

docs/writing.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,22 +37,46 @@ that can be used to define the connection details), and `mode` (which must equal
3737
the collections, permissions , and URI prefix are optional, though it is uncommon to write documents without any
3838
permissions.
3939

40-
### Writing file rows as document
40+
### Writing file rows as documents
4141

4242
To support the common use case of reading files and ingesting their contents as-is into MarkLogic, the connector has
4343
special support for rows with a schema matching that of
4444
[Spark's binaryFile data source](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html). If the incoming
4545
rows adhere to the `binaryFile` schema, the connector will not serialize the row into JSON. Instead, the connector will
4646
use the `path` column value as an initial URI for the document and the `content` column value as the document contents.
47-
48-
The URI can then be further adjusted as described in the "Controlling document URIs"
49-
The URI can then be adjusted as described in the "Controlling documents URIs" section below.
47+
The URI can then be further adjusted as described in the "Controlling document URIs".
5048

5149
This feature allows for ingesting files of any type. The MarkLogic REST API will
5250
[determine the document type](https://docs.marklogic.com/guide/rest-dev/intro#id_53367) based on the URI extension, if
5351
MarkLogic recognizes it. If MarkLogic does not recognize the extension, and you wish to force a document type on each of
5452
the documents, you can set the `spark.marklogic.write.files.documentType` option to one of `XML`, `JSON`, or `TEXT`.
5553

54+
### Writing document rows
55+
56+
As of the 2.2.0 release, you can [read documents from MarkLogic](reading-data/documents.md). A common use case is to then write these rows
57+
to another database, or another MarkLogic cluster, or even the same database the documents were read from, potentially
58+
transforming them and altering their URIs.
59+
60+
"Document rows" adhere to the following Spark schema, which is important to understand when writing these rows as
61+
documents to MarkLogic:
62+
63+
1. `URI` is of type `string`.
64+
2. `content` is of type `binary`.
65+
3. `format` is of type `string`.
66+
4. `collections` is an array of `string`s.
67+
5. `permissions` is a map with keys of type `string` and values that are arrays of `string`s.
68+
6. `quality` is an `integer`.
69+
7. `properties` is a map with keys and values of type `string`.
70+
8. `metadataValues` is a map with keys and values of type `string`.
71+
72+
Writing rows corresponding to the "document row" schema is largely the same as writing rows of any arbitrary schema,
73+
but bear in mind these differences:
74+
75+
1. All the column values will be honored if populated.
76+
2. The `collections` and `permissions` will be replaced - not added to - if the `spark.marklogic.write.collections` and
77+
`spark.marklogic.write.permissions` options are specified.
78+
3. The `spark.marklogic.write.uriTemplate` option is less useful as only the `URI` and `format` column values are available for use in the template.
79+
5680
### Controlling document content
5781

5882
Rows in a Spark DataFrame are written to MarkLogic by default as JSON documents. Each column in a row becomes a
@@ -198,6 +222,17 @@ Optimizing performance will thus involve testing various combinations of partiti
198222
counts. The [MarkLogic Monitoring tool](https://docs.marklogic.com/guide/monitoring/intro) can help you understand
199223
resource consumption and throughput from Spark to MarkLogic.
200224

225+
**You should take care** not to exceed the number of requests that your MarkLogic cluster can reasonably handle at a
226+
given time. A general rule of thumb is not to use more threads than the number of hosts multiplied by the number of
227+
threads per app server. A MarkLogic app server defaults to a limit of 32 threads. So for a 3-host cluster, you should
228+
not exceed 96 total threads. This also assumes that each host is receiving requests - either via a load balancer placed
229+
in front of the MarkLogic cluster, or by setting the `spark.marklogic.client.connectionType` option to `direct` when
230+
the connector can directly connect to each host in the cluster.
231+
232+
The rule of thumb above can thus be expressed as:
233+
234+
Number of partitions * Value of spark.marklogic.write.threadCount <= Number of hosts * number of app server threads
235+
201236
### Error handling
202237

203238
The connector may throw an error during one of two phases of operation - before it begins to write data to MarkLogic,

src/main/java/com/marklogic/spark/Options.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,16 +44,16 @@ public abstract class Options {
4444
// "categories" as defined by https://docs.marklogic.com/REST/GET/v1/documents .
4545
public static final String READ_DOCUMENTS_CATEGORIES = "spark.marklogic.read.documents.categories";
4646
public static final String READ_DOCUMENTS_COLLECTIONS = "spark.marklogic.read.documents.collections";
47+
public static final String READ_DOCUMENTS_DIRECTORY = "spark.marklogic.read.documents.directory";
48+
public static final String READ_DOCUMENTS_OPTIONS = "spark.marklogic.read.documents.options";
49+
public static final String READ_DOCUMENTS_PARTITIONS_PER_FOREST = "spark.marklogic.read.documents.partitionsPerForest";
4750
// Corresponds to "q" at https://docs.marklogic.com/REST/POST/v1/search, known as a "string query".
48-
public static final String READ_DOCUMENTS_STRING_QUERY = "spark.marklogic.read.documents.stringQuery";
4951
// Corresponds to the complex query submitted via the request body at https://docs.marklogic.com/REST/POST/v1/search .
5052
public static final String READ_DOCUMENTS_QUERY = "spark.marklogic.read.documents.query";
51-
public static final String READ_DOCUMENTS_OPTIONS = "spark.marklogic.read.documents.options";
52-
public static final String READ_DOCUMENTS_DIRECTORY = "spark.marklogic.read.documents.directory";
53+
public static final String READ_DOCUMENTS_STRING_QUERY = "spark.marklogic.read.documents.stringQuery";
5354
public static final String READ_DOCUMENTS_TRANSFORM = "spark.marklogic.read.documents.transform";
5455
public static final String READ_DOCUMENTS_TRANSFORM_PARAMS = "spark.marklogic.read.documents.transformParams";
5556
public static final String READ_DOCUMENTS_TRANSFORM_PARAMS_DELIMITER = "spark.marklogic.read.documents.transformParamsDelimiter";
56-
public static final String READ_DOCUMENTS_PARTITIONS_PER_FOREST = "spark.marklogic.read.documents.partitionsPerForest";
5757

5858
public static final String READ_FILES_COMPRESSION = "spark.marklogic.read.files.compression";
5959

0 commit comments

Comments
 (0)