@@ -241,6 +241,46 @@ spark.read.format("com.marklogic.spark") \
241
241
.save()
242
242
```
243
243
244
+ ### Processing multiple rows in a single call
245
+
246
+ By default, a single row is sent by the connector to your custom code. In many use cases, particularly when writing
247
+ documents, you will achieve far better performance when configuring the connector to send many rows in a single
248
+ call to your custom code.
249
+
250
+ The configuration option ` spark.marklogic.write.batchSize ` controls the number of row values sent to the custom code.
251
+ If not specified, this defaults to 1 (as opposed to 100 when writing rows as documents). If set to a
252
+ value greater than one, then the values will be sent in the following manner:
253
+
254
+ 1 . If a custom schema is used, then the JSON objects representing the set of rows in the batch will first be added to a
255
+ JSON array, and then the array will be set to the external variable.
256
+ 2 . Otherwise, the row values from the "URI" column will be concatenated together with a comma as a delimiter.
257
+
258
+
259
+ For approach #2 , an alternate delimiter can be configured via ` spark.marklogic.write.externalVariableDelimiter ` . This
260
+ would be needed in case your "URI" values may have commas in them. Regardless of the delimiter value, you will
261
+ typically use code like that shown below for splitting the "URI" value into many values:
262
+
263
+ ```
264
+ for (var uri of URI.split(',')) {
265
+ // Process each row value here.
266
+ }
267
+ ```
268
+
269
+ When using a custom schema, you will typically use [ xdmp.fromJSON] ( https://docs.marklogic.com/xdmp.fromJSON ) to convert
270
+ the value passed to your custom code into a JSON array:
271
+
272
+ ```
273
+ // Assumes that URI is a JSON array node because a custom schema is being used.
274
+ const array = fn.head(xdmp.fromJSON(URI));
275
+ ```
276
+
277
+ Processing multiple rows in a single call can have a significant impact on performance by reducing the number of calls
278
+ to MarkLogic. For example, if you are writing documents with your custom code, it is recommended to try a batch size of
279
+ 100 or greater to test how much performance improves. The
280
+ [ MarkLogic monitoring dashboard] ( https://docs.marklogic.com/guide/monitoring/dashboard ) is a very useful tool for
281
+ examining how many requests are being sent by the connector to MarkLogic and how quickly each request is processed,
282
+ along with overall resource consumption.
283
+
244
284
### External variable configuration
245
285
246
286
As shown in the examples above, the custom code for processing a row must have an external variable named "URI". If
@@ -296,27 +336,6 @@ allowing you to access its data:
296
336
const doc = fn.head(xdmp.fromJSON(URI));
297
337
```
298
338
299
- ### Processing multiple rows in a single call
300
-
301
- The configuration option ` spark.marklogic.write.batchSize ` controls the number of row values sent to the custom code
302
- in a single call. If not specified, this defaults to 1 (as opposed to 100 when writing rows as documents). If set to a
303
- value greater than one, then the values will be sent in the following manner:
304
-
305
- 1 . If a custom schema is used, then the JSON objects representing the set of rows in the batch will first be added to a
306
- JSON array, and then the array will be set to the external variable.
307
- 2 . Otherwise, the row values from the "URI" column will be concatenated together with a comma as a delimiter.
308
-
309
- For approach #2 , an alternate delimiter can be configured via ` spark.marklogic.write.externalVariableDelimiter ` . This
310
- would be needed in case your "URI" values may have commas in them.
311
-
312
- When using a custom schema, you will typically use [ xdmp.fromJSON] ( https://docs.marklogic.com/xdmp.fromJSON ) to convert
313
- the value passed to your custom code into a JSON array:
314
-
315
- ```
316
- // Assumes that URI is a JSON array node because a custom schema is being used.
317
- const array = fn.head(xdmp.fromJSON(URI));
318
- ```
319
-
320
339
### Streaming support
321
340
322
341
Spark's support for [ streaming writes] ( https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html )
0 commit comments