You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/writing.md
+46Lines changed: 46 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -199,6 +199,52 @@ also specify a temporal collection for each document to be assigned to via the
199
199
`spark.marklogic.write.temporalCollection`. Each document must define values for the axes associated with the
200
200
temporal collection.
201
201
202
+
### Splitting text and adding embeddings
203
+
204
+
The 2.5.0 connector release includes support for splitting the text in a document into one or more chunks, either
205
+
written to the document or to separate sidecar documents. It also supports adding vector embeddings to chunks.
206
+
207
+
Please see the [Flux import guide](https://marklogic.github.io/flux/import/import.html) for information on both
208
+
features. While the features are primarily intended for use in Flux, they can both be used with the connector as well
209
+
via the options described below.
210
+
211
+
The options controlling the splitter feature are:
212
+
213
+
| Option | Description |
214
+
| --- | --- |
215
+
| spark.marklogic.write.splitter.xpath | Enables the splitter feature by defining an XPath expression for selecting text to split in a document. |
216
+
| spark.marklogic.write.splitter.jsonPointers | Enables the splitter feature by defining one or more newline-delimited JSON Pointer expressions for selecting text to split in a document. |
217
+
| spark.marklogic.writer.splitter.text | Enables the splitter feature by declaring that all the text in a document should be split. This is typically for text documents, but can be used for JSON and XML as well. |
218
+
| spark.marklogic.write.splitter.maxChunkSize | Defines the maximum chunk size in characters. Defaults to 1000. |
219
+
| spark.marklogic.write.splitter.maxOverlapSize | Defines the maximum overlap size in characters between two chunks. Defaults to 0. |
220
+
| spark.marklogic.write.splitter.regex | Defines a regex for splitting text into chunks. The default strategy is LangChain4J's "recursive" strategy that splits on paragraphs, sentences, lines, and words. |
221
+
| spark.marklogic.splitter.joinDelimiter | Defines a delimiter for usage with the splitter regex option. The delimiter joins together two or more chunks identified via the regex to produce a chunk that is as close as possible to the maximum chunk size. |
222
+
| spark.marklogic.write.splitter.customClass | Defines the class name of an implementation of LangChain4j's `dev.langchain4j.data.document.DocumentSplitter` interface to be used for splitting the selected text into chunks. |
223
+
| spark.marklogic.write.splitter.customClass.option. | Prefix for one or more options to pass in a `Map<String, String>` to the constructor of the custom splitter class. |
224
+
| spark.marklogic.write.splitter.sidecar.maxChunks | Configures the connector to write chunks to separate "sidecar" documents instead of to the source document (the default behavior). Defines the maximum number of chunks to write to a sidecar document. |
225
+
| spark.marklogic.write.splitter.sidecar.documentType | Defines the type - either JSON or XML - of each chunk document. Defaults to the type of the source document. |
226
+
| spark.marklogic.write.splitter.sidecar.collections | Comma-delimited list of collections to assign to each chunk document. |
227
+
| spark.marklogic.write.splitter.sidecar.permissions | Comma-delimited list of roles and capabilities to assign to each chunk document. If not defined, chunk documents will inherit the permissions defined by `spark.marklogic.write.permissions`.
228
+
| spark.marklogic.write.splitter.sidecar.rootName | Root name for a JSON or XML sidecar chunk document. |
229
+
| spark.marklogic.write.splitter.sidecar.uriPrefix | URI prefix for each sidecar chunk document. If defined, will be followed by a UUID. |
230
+
| spark.marklogic.write.splitter.sidecar.uriSuffix | URI suffix for each sidecar chunk document. If defined, will be preceded by a UUID. |
231
+
| spark.marklogic.write.splitter.sidecar.xmlNamespace | Namespace for XML sidecar chunk documents. |
232
+
| spark.marklogic.xpath. | Prefix for registering XML namespace prefixes and URIs that can be reused in any connector feature that accepts an XPath expression. |
233
+
234
+
The options controlling the embedder feature are:
235
+
236
+
| Option | Description |
237
+
| --- | --- |
238
+
| spark.marklogic.write.embedder.modelFunction.className | Enables the embedder feature; name of a class on the classpath that implements the interface `Function<Map<String, String>, EmbeddingModel>`. |
239
+
| spark.marklogic.write.embedder.modelFunction.option. | Prefix for each option passed in a `Map<String, String>` to the `apply` method of the model function class. |
240
+
| spark.marklogic.write.embedder.chunks.jsonPointer | Defines the location of JSON chunks when using the embedder separate from the splitter. |
241
+
| spark.marklogic.write.embedder.text.jsonPointer | Defines the location of text in JSON chunks when using the embedder separate from the splitter. |
242
+
| spark.marklogic.write.embedder.chunks.xpath | Defines the location of XML chunks when using the embedder separate from the splitter. |
243
+
| spark.marklogic.write.embedder.text.xpath | Defines the location of text in XML chunks when using the embedder separate from the splitter. |
244
+
| spark.marklogic.write.embedder.embedding.name | Allows for the embedding name to be customized when the embedding is added to a JSON or XML chunk. |
245
+
| spark.marklogic.write.embedder.embedding.namespace | Allows for an optional namespace to be assigned to the embedding element in an XML chunk. |
246
+
| spark.marklogic.write.embedder.batchSize | Defines the number of chunks to send to the embedding model in a single call. Defaults to 1. |
0 commit comments