Merge pull request #292 from marklogic/feature/2.3.2-docs

rjrudin · web-flow · commit 2113b88edd5c · 2024-09-24T14:47:26.000-04:00
Added docs for streaming files
diff --git a/docs/reading-data/reading-files/generic-file-support.md b/docs/reading-data/reading-files/generic-file-support.md
@@ -44,6 +44,24 @@ The connector also supports the following
 - Use `recursiveFileLookup` to include files in child directories.
 - Use `modifiedBefore` and `modifiedAfter` to select files based on their modification time.
 
+## Reading and writing large binary files
+
+The 2.3.2 connector introduces a fix for reading and writing large binary files to MarkLogic, allowing for the contents
+of each file to be streamed from its source to MarkLogic. This avoids an issue where the Spark environment runs out
+of memory while trying to fit the contents of a file into an in-memory row. 
+
+To enable this, include the following in the set of options passed to your reader:
+
+    .option("spark.marklogic.files.stream", "true")
+
+As a result of this option, the `content` column in each row will not contain the contents of the file. Instead, 
+it will contain a serialized object intended to be used during the write phase to read the contents of the file as a 
+stream. 
+
+Files read from the MarkLogic Spark connector with the above option can then be written as documents to MarkLogic 
+with the same option above being passed to the writer. The connector will then stream the contents of each file to
+MarkLogic, submitting one request to MarkLogic per document. 
+
 ## Reading any file
 
 If you wish to read files without any special handling provided by the connector, you can use the