You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ui/workflows.mdx
+4Lines changed: 4 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -241,6 +241,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
241
241
242
242
-**Chunk by title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
243
243
244
+
- **Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
244
245
- **Combine text under n chars**: Combine elements until a section reaches a length of this many characters. The default is **0**.
245
246
- **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
246
247
- **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **2048**.
@@ -251,6 +252,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
251
252
252
253
-**Chunk by character** (also known as _basic_ chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
253
254
255
+
-**Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
254
256
-**Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
255
257
-**Max characters**: Cut off new sections after reaching a length of this many characters. The default is **2048**.
256
258
-**New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*.
@@ -259,6 +261,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
259
261
260
262
-**Chunk by page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
261
263
264
+
-**Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
262
265
-**Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
263
266
-**Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
264
267
-**New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **50**.
@@ -267,6 +270,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
267
270
268
271
-**Chunk by similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
269
272
273
+
-**Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
270
274
-**Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
271
275
-**Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
272
276
-**Similarity threshold**: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is **0.5**. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
0 commit comments