Skip to content

Commit c830457

Browse files
committed
Bumped up section on processing multiple rows
1 parent 1c133f0 commit c830457

File tree

1 file changed

+40
-21
lines changed

1 file changed

+40
-21
lines changed

docs/writing.md

Lines changed: 40 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,46 @@ spark.read.format("com.marklogic.spark") \
241241
.save()
242242
```
243243

244+
### Processing multiple rows in a single call
245+
246+
By default, a single row is sent by the connector to your custom code. In many use cases, particularly when writing
247+
documents, you will achieve far better performance when configuring the connector to send many rows in a single
248+
call to your custom code.
249+
250+
The configuration option `spark.marklogic.write.batchSize` controls the number of row values sent to the custom code.
251+
If not specified, this defaults to 1 (as opposed to 100 when writing rows as documents). If set to a
252+
value greater than one, then the values will be sent in the following manner:
253+
254+
1. If a custom schema is used, then the JSON objects representing the set of rows in the batch will first be added to a
255+
JSON array, and then the array will be set to the external variable.
256+
2. Otherwise, the row values from the "URI" column will be concatenated together with a comma as a delimiter.
257+
258+
259+
For approach #2, an alternate delimiter can be configured via `spark.marklogic.write.externalVariableDelimiter`. This
260+
would be needed in case your "URI" values may have commas in them. Regardless of the delimiter value, you will
261+
typically use code like that shown below for splitting the "URI" value into many values:
262+
263+
```
264+
for (var uri of URI.split(',')) {
265+
// Process each row value here.
266+
}
267+
```
268+
269+
When using a custom schema, you will typically use [xdmp.fromJSON](https://docs.marklogic.com/xdmp.fromJSON) to convert
270+
the value passed to your custom code into a JSON array:
271+
272+
```
273+
// Assumes that URI is a JSON array node because a custom schema is being used.
274+
const array = fn.head(xdmp.fromJSON(URI));
275+
```
276+
277+
Processing multiple rows in a single call can have a significant impact on performance by reducing the number of calls
278+
to MarkLogic. For example, if you are writing documents with your custom code, it is recommended to try a batch size of
279+
100 or greater to test how much performance improves. The
280+
[MarkLogic monitoring dashboard](https://docs.marklogic.com/guide/monitoring/dashboard) is a very useful tool for
281+
examining how many requests are being sent by the connector to MarkLogic and how quickly each request is processed,
282+
along with overall resource consumption.
283+
244284
### External variable configuration
245285

246286
As shown in the examples above, the custom code for processing a row must have an external variable named "URI". If
@@ -296,27 +336,6 @@ allowing you to access its data:
296336
const doc = fn.head(xdmp.fromJSON(URI));
297337
```
298338

299-
### Processing multiple rows in a single call
300-
301-
The configuration option `spark.marklogic.write.batchSize` controls the number of row values sent to the custom code
302-
in a single call. If not specified, this defaults to 1 (as opposed to 100 when writing rows as documents). If set to a
303-
value greater than one, then the values will be sent in the following manner:
304-
305-
1. If a custom schema is used, then the JSON objects representing the set of rows in the batch will first be added to a
306-
JSON array, and then the array will be set to the external variable.
307-
2. Otherwise, the row values from the "URI" column will be concatenated together with a comma as a delimiter.
308-
309-
For approach #2, an alternate delimiter can be configured via `spark.marklogic.write.externalVariableDelimiter`. This
310-
would be needed in case your "URI" values may have commas in them.
311-
312-
When using a custom schema, you will typically use [xdmp.fromJSON](https://docs.marklogic.com/xdmp.fromJSON) to convert
313-
the value passed to your custom code into a JSON array:
314-
315-
```
316-
// Assumes that URI is a JSON array node because a custom schema is being used.
317-
const array = fn.head(xdmp.fromJSON(URI));
318-
```
319-
320339
### Streaming support
321340

322341
Spark's support for [streaming writes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

0 commit comments

Comments
 (0)