Skip to content

Commit 2113b88

Browse files
authored
Merge pull request #292 from marklogic/feature/2.3.2-docs
Added docs for streaming files
2 parents 59d3514 + 046e1a0 commit 2113b88

File tree

1 file changed

+18
-0
lines changed

1 file changed

+18
-0
lines changed

docs/reading-data/reading-files/generic-file-support.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,24 @@ The connector also supports the following
4444
- Use `recursiveFileLookup` to include files in child directories.
4545
- Use `modifiedBefore` and `modifiedAfter` to select files based on their modification time.
4646

47+
## Reading and writing large binary files
48+
49+
The 2.3.2 connector introduces a fix for reading and writing large binary files to MarkLogic, allowing for the contents
50+
of each file to be streamed from its source to MarkLogic. This avoids an issue where the Spark environment runs out
51+
of memory while trying to fit the contents of a file into an in-memory row.
52+
53+
To enable this, include the following in the set of options passed to your reader:
54+
55+
.option("spark.marklogic.files.stream", "true")
56+
57+
As a result of this option, the `content` column in each row will not contain the contents of the file. Instead,
58+
it will contain a serialized object intended to be used during the write phase to read the contents of the file as a
59+
stream.
60+
61+
Files read from the MarkLogic Spark connector with the above option can then be written as documents to MarkLogic
62+
with the same option above being passed to the writer. The connector will then stream the contents of each file to
63+
MarkLogic, submitting one request to MarkLogic per document.
64+
4765
## Reading any file
4866

4967
If you wish to read files without any special handling provided by the connector, you can use the

0 commit comments

Comments
 (0)