@@ -8,7 +8,23 @@ The MarkLogic Spark connector allows for data to be retrieved from MarkLogic as
8
8
[ Optic DSL query] ( https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710 ) . The
9
9
sections below provide more detail on configuring how data is retrieved and converted into a Spark DataFrame.
10
10
11
- ## Query requirements
11
+ ## Basic read operation
12
+
13
+ As shown in the [ Getting Started with PySpark guide] ( getting-started/pyspark.md ) , a basic read operation will define
14
+ how the connector should connect to MarkLogic, the MarkLogic Optic query to run, and zero or more other options:
15
+
16
+ ```
17
+ df = spark.read.format("com.marklogic.spark") \
18
+ .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8020") \
19
+ .option("spark.marklogic.read.opticQuery", "op.fromView('example', 'employee')") \
20
+ .load()
21
+ ```
22
+
23
+ As shown above, ` format ` , ` spark.marklogic.client.uri ` (or the other ` spark.marklogic.client ` options
24
+ that can be used to define the connection details), and ` spark.marklogic.read.opticQuery ` are always required. The
25
+ following sections provide more details about these and other options that can be set.
26
+
27
+ ## Optic query requirements
12
28
13
29
As of the 2.0 release of the connector, the Optic query must use the
14
30
[ op.fromView] ( https://docs.marklogic.com/op.fromView ) accessor function. The query must also adhere to the
@@ -87,7 +103,7 @@ stream.stop()
87
103
Micro-batches are constructed based on the number of partitions and user-defined batch size; more information on each
88
104
setting can be found in section below on tuning performance. Each request to MarkLogic that is made in "batch read"
89
105
mode - i.e. when using Spark's ` read ` function instead of ` readStream ` - corresponds to a micro-batch when reading
90
- data via a stream. In the example above, which uses the connector's default batch size of 10 ,000 rows and 2
106
+ data via a stream. In the example above, which uses the connector's default batch size of 100 ,000 rows and 2
91
107
partitions, 2 calls are made to MarkLogic, resulting in two micro-batches.
92
108
93
109
The number of micro-batches can be determined by enabling info-level logging and looking for a message similar to:
@@ -169,40 +185,46 @@ correct result, please [file an issue with this project](https://github.com/mark
169
185
170
186
## Tuning performance
171
187
172
- The primary factor affecting how quickly the connector can retrieve rows is MarkLogic's ability to process your Optic
173
- query. The
174
- [ MarkLogic Optic performance documentation] ( https://docs.marklogic.com/guide/app-dev/OpticAPI#id_91398 ) can help with
175
- optimizing your query to maximize performance.
188
+ The primary factor affecting connector performance when reading rows is how many requests are made to MarkLogic. In
189
+ general, performance will be best when minimizing the number of requests to MarkLogic while ensuring that no single
190
+ request attempts to return or process too much data.
176
191
177
- Two [ configuration options] ( configuration.md ) in the connector will also impact performance . First, the
192
+ Two [ configuration options] ( configuration.md ) control how many requests are made . First, the
178
193
` spark.marklogic.read.numPartitions ` option controls how many partitions are created. For each partition, Spark
179
194
will use a separate task to send requests to MarkLogic to retrieve rows matching your Optic DSL query. Second, the
180
195
` spark.marklogic.read.batchSize ` option controls approximately how many rows will be retrieved in each call to
181
196
MarkLogic.
182
197
183
- These two options impact each other in terms of how many tasks are used to make requests to MarkLogic. For example ,
184
- consider an Optic query that matches 1 million rows in MarkLogic, a partition count of 10, and a batch size of
185
- 10 ,000 rows (the default value). This configuration will result in the connector creating 10 Spark partition readers,
186
- each of which will retrieve approximately 100 ,000 unique rows. And with a batch size of 10 ,000, each partition
198
+ To understand how these options control the number of requests to MarkLogic,
199
+ consider an Optic query that matches 10 million rows in MarkLogic, a partition count of 10, and a batch size of
200
+ 100 ,000 rows (the default value). This configuration will result in the connector creating 10 Spark partition readers,
201
+ each of which will retrieve approximately 1 ,000,000 unique rows. And with a batch size of 100 ,000, each partition
187
202
reader will make approximately 10 calls to MarkLogic to retrieve these rows, for a total of 100 calls across all
188
- partitions.
203
+ partitions.
189
204
190
- Performance can thus be tested by varying the number of partitions and the batch size. In general, increasing the
191
- number of partitions should help performance as the number of matching rows increases. A single partition may suffice
192
- for a query that returns thousands of rows or fewer, while a query that returns hundreds of millions of rows will
193
- benefit from dozens of partitions or more. The ideal settings will depend on your Spark and MarkLogic environments
194
- along with the complexity of your Optic query. Testing should be performed with different queries, partition counts,
195
- and batch sizes to determine the optimal settings.
205
+ Performance should be tested by varying the number of partitions and the batch size. In general, increasing the
206
+ number of partitions should help performance as the number of rows to return increases. Determining the optimal batch
207
+ size depends both on the number of columns in each returned row and what kind of Spark operations are being invoked.
208
+ The next section describes both how the connector tries to optimize performance when an aggregation is performed
209
+ and when the same kind of optimization should be made when not many rows need to be returned.
196
210
197
211
### Optimizing for smaller result sets
198
212
199
213
If your Optic query matches a set of rows whose count is a small percentage of the total number of rows in
200
- the view that the query runs against, you may find improved performance by setting ` spark.marklogic.read.batchSize `
201
- to zero. Doing so ensures that for each partition, a single request is sent to MarkLogic.
202
-
203
- If the result set matching your query is particularly small - such as thousands of rows or less, or possibly tens of
204
- thousands of rows or less - you may find optimal performance by also setting ` spark.marklogic.read.numPartitions ` to
205
- one. This will result in the connector sending a single request to MarkLogic.
214
+ the view that the query runs against, you should find improved performance by setting ` spark.marklogic.read.batchSize `
215
+ to zero. This setting ensures that for each partition, a single request is sent to MarkLogic.
216
+
217
+ If your Spark program includes an aggregation that the connector can push down to MarkLogic, then the connector will
218
+ automatically use a batch size of zero unless you specify a different value for ` spark.marklogic.read.batchSize ` . This
219
+ optimization should typically be desirable when calculating an aggregation, as MarkLogic will return far fewer rows
220
+ per request depending on the type of aggregation.
221
+
222
+ If the result set matching your query is particularly small - such as tens of thousands of rows or less, or possibly
223
+ hundreds of thousands of rows or less - you may find optimal performance by setting
224
+ ` spark.marklogic.read.numPartitions ` to one. This will result in the connector sending a single request to MarkLogic.
225
+ The effectiveness of this approach can be evaluated by executing the Optic query via
226
+ [ MarkLogic's qconsole application] ( https://docs.marklogic.com/guide/qconsole/intro ) , which will execute the query in
227
+ a single request as well.
206
228
207
229
### More detail on partitions
208
230
0 commit comments