Skip to content

Commit 9481130

Browse files
authored
Merge pull request #377 from marklogic/feature/more-docs
Added splitter / embedder options to docs
2 parents 755e92b + cce8aa6 commit 9481130

File tree

2 files changed

+56
-5
lines changed

2 files changed

+56
-5
lines changed

docs/writing.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,52 @@ also specify a temporal collection for each document to be assigned to via the
199199
`spark.marklogic.write.temporalCollection`. Each document must define values for the axes associated with the
200200
temporal collection.
201201

202+
### Splitting text and adding embeddings
203+
204+
The 2.5.0 connector release includes support for splitting the text in a document into one or more chunks, either
205+
written to the document or to separate sidecar documents. It also supports adding vector embeddings to chunks.
206+
207+
Please see the [Flux import guide](https://marklogic.github.io/flux/import/import.html) for information on both
208+
features. While the features are primarily intended for use in Flux, they can both be used with the connector as well
209+
via the options described below.
210+
211+
The options controlling the splitter feature are:
212+
213+
| Option | Description |
214+
| --- | --- |
215+
| spark.marklogic.write.splitter.xpath | Enables the splitter feature by defining an XPath expression for selecting text to split in a document. |
216+
| spark.marklogic.write.splitter.jsonPointers | Enables the splitter feature by defining one or more newline-delimited JSON Pointer expressions for selecting text to split in a document. |
217+
| spark.marklogic.writer.splitter.text | Enables the splitter feature by declaring that all the text in a document should be split. This is typically for text documents, but can be used for JSON and XML as well. |
218+
| spark.marklogic.write.splitter.maxChunkSize | Defines the maximum chunk size in characters. Defaults to 1000. |
219+
| spark.marklogic.write.splitter.maxOverlapSize | Defines the maximum overlap size in characters between two chunks. Defaults to 0. |
220+
| spark.marklogic.write.splitter.regex | Defines a regex for splitting text into chunks. The default strategy is LangChain4J's "recursive" strategy that splits on paragraphs, sentences, lines, and words. |
221+
| spark.marklogic.splitter.joinDelimiter | Defines a delimiter for usage with the splitter regex option. The delimiter joins together two or more chunks identified via the regex to produce a chunk that is as close as possible to the maximum chunk size. |
222+
| spark.marklogic.write.splitter.customClass | Defines the class name of an implementation of LangChain4j's `dev.langchain4j.data.document.DocumentSplitter` interface to be used for splitting the selected text into chunks. |
223+
| spark.marklogic.write.splitter.customClass.option. | Prefix for one or more options to pass in a `Map<String, String>` to the constructor of the custom splitter class. |
224+
| spark.marklogic.write.splitter.sidecar.maxChunks | Configures the connector to write chunks to separate "sidecar" documents instead of to the source document (the default behavior). Defines the maximum number of chunks to write to a sidecar document. |
225+
| spark.marklogic.write.splitter.sidecar.documentType | Defines the type - either JSON or XML - of each chunk document. Defaults to the type of the source document. |
226+
| spark.marklogic.write.splitter.sidecar.collections | Comma-delimited list of collections to assign to each chunk document. |
227+
| spark.marklogic.write.splitter.sidecar.permissions | Comma-delimited list of roles and capabilities to assign to each chunk document. If not defined, chunk documents will inherit the permissions defined by `spark.marklogic.write.permissions`.
228+
| spark.marklogic.write.splitter.sidecar.rootName | Root name for a JSON or XML sidecar chunk document. |
229+
| spark.marklogic.write.splitter.sidecar.uriPrefix | URI prefix for each sidecar chunk document. If defined, will be followed by a UUID. |
230+
| spark.marklogic.write.splitter.sidecar.uriSuffix | URI suffix for each sidecar chunk document. If defined, will be preceded by a UUID. |
231+
| spark.marklogic.write.splitter.sidecar.xmlNamespace | Namespace for XML sidecar chunk documents. |
232+
| spark.marklogic.xpath. | Prefix for registering XML namespace prefixes and URIs that can be reused in any connector feature that accepts an XPath expression. |
233+
234+
The options controlling the embedder feature are:
235+
236+
| Option | Description |
237+
| --- | --- |
238+
| spark.marklogic.write.embedder.modelFunction.className | Enables the embedder feature; name of a class on the classpath that implements the interface `Function<Map<String, String>, EmbeddingModel>`. |
239+
| spark.marklogic.write.embedder.modelFunction.option. | Prefix for each option passed in a `Map<String, String>` to the `apply` method of the model function class. |
240+
| spark.marklogic.write.embedder.chunks.jsonPointer | Defines the location of JSON chunks when using the embedder separate from the splitter. |
241+
| spark.marklogic.write.embedder.text.jsonPointer | Defines the location of text in JSON chunks when using the embedder separate from the splitter. |
242+
| spark.marklogic.write.embedder.chunks.xpath | Defines the location of XML chunks when using the embedder separate from the splitter. |
243+
| spark.marklogic.write.embedder.text.xpath | Defines the location of text in XML chunks when using the embedder separate from the splitter. |
244+
| spark.marklogic.write.embedder.embedding.name | Allows for the embedding name to be customized when the embedding is added to a JSON or XML chunk. |
245+
| spark.marklogic.write.embedder.embedding.namespace | Allows for an optional namespace to be assigned to the embedding element in an XML chunk. |
246+
| spark.marklogic.write.embedder.batchSize | Defines the number of chunks to send to the embedding model in a single call. Defaults to 1. |
247+
202248
### Streaming support
203249

204250
The connector supports

marklogic-spark-api/src/main/java/com/marklogic/spark/Options.java

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ public abstract class Options {
144144

145145
/**
146146
* Enables the splitter feature by declaring that all the text in a document should be split. This is typically for
147-
* text documents, but could be used for JSON and XML as well.
147+
* text documents, but can be used for JSON and XML as well.
148148
*
149149
* @since 2.5.0
150150
*/
@@ -165,7 +165,7 @@ public abstract class Options {
165165
public static final String WRITE_SPLITTER_MAX_OVERLAP_SIZE = "spark.marklogic.write.splitter.maxOverlapSize";
166166

167167
/**
168-
* Defines a regex for splitting text into chunks. The default strategy is langchain4's "recursive" strategy that
168+
* Defines a regex for splitting text into chunks. The default strategy is LangChain4J's "recursive" strategy that
169169
* splits on paragraphs, sentences, lines, and words.
170170
*
171171
* @since 2.5.0
@@ -181,15 +181,15 @@ public abstract class Options {
181181
public static final String WRITE_SPLITTER_JOIN_DELIMITER = "spark.marklogic.splitter.joinDelimiter";
182182

183183
/**
184-
* Defines the class name of an implementation of langchain4j's {@code dev.langchain4j.data.document.DocumentSplitter}
184+
* Defines the class name of an implementation of LangChain4J's {@code dev.langchain4j.data.document.DocumentSplitter}
185185
* interface to be used for splitting the selected text into chunks.
186186
*
187187
* @since 2.5.0
188188
*/
189189
public static final String WRITE_SPLITTER_CUSTOM_CLASS = "spark.marklogic.write.splitter.customClass";
190190

191191
/**
192-
* Defines one or more options to pass in a {@code Map<String, String>} to the constructor of the custom splitter
192+
* Prefix for one or more options to pass in a {@code Map<String, String>} to the constructor of the custom splitter
193193
* class.
194194
*
195195
* @since 2.5.0
@@ -299,19 +299,24 @@ public abstract class Options {
299299
public static final String STREAM_FILES = "spark.marklogic.streamFiles";
300300

301301
/**
302-
* Provides a "global" option for registering XML namespace prefixes and URIs that can be reused in any connector
302+
* Prefix for registering XML namespace prefixes and URIs that can be reused in any connector
303303
* feature that accepts an XPath expression.
304304
*
305305
* @since 2.5.0
306306
*/
307307
public static final String XPATH_NAMESPACE_PREFIX = "spark.marklogic.xpath.";
308308

309309
/**
310+
* Enables the embedder feature; name of a class on the classpath that implements the interface
311+
* {@code Function<Map<String, String>, EmbeddingModel>}.
312+
*
310313
* @since 2.5.0
311314
*/
312315
public static final String WRITE_EMBEDDER_MODEL_FUNCTION_CLASS_NAME = "spark.marklogic.write.embedder.modelFunction.className";
313316

314317
/**
318+
* Prefix for each option passed in a {@code Map<String, String>} to the {@code apply} method of the model function class.
319+
*
315320
* @since 2.5.0
316321
*/
317322
public static final String WRITE_EMBEDDER_MODEL_FUNCTION_OPTION_PREFIX = "spark.marklogic.write.embedder.modelFunction.option.";

0 commit comments

Comments
 (0)