Skip to content

Commit 514601a

Browse files
authored
Platform: Contextual chunking (#474)
1 parent 2921f51 commit 514601a

File tree

2 files changed

+108
-0
lines changed

2 files changed

+108
-0
lines changed

ui/chunking.mdx

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,110 @@ To specify this setting, enter a number into the **Similarity threshold** field.
159159

160160
This setting applies only to the chunking strategy **Chunk by similarity**.
161161

162+
## Contextual chunking
163+
164+
A technique known as _contextual chunking_ prepends chunk-specific explanatory context to each chunk.
165+
Contextual chunking has been shown to enhance traditional RAG solutions by yielding
166+
significant improvements in retrieval accuracy, which directly translates to better performance in downstream tasks.
167+
[Learn more](https://www.anthropic.com/news/contextual-retrieval).
168+
169+
To apply contextual chunking, switch on the **Contextual chunking** toggle in the settings for any chunking strategy.
170+
171+
This chunk-specific explanatory context information is typically a couple of sentences in length.
172+
Contextual chunking happens before any embeddings are generated.
173+
174+
When contextual chunking is applied, the contextual information in each chunk begins with `Prefix:` and ends with a semicolon (`;`).
175+
The chunk's original content begins with `Original:`.
176+
177+
For example, without contextual chunking applied, elements would for instance be generated similar to the following.
178+
Line breaks have been inserted here for readability. The output will not contain these line breaks:
179+
180+
```json
181+
{
182+
"type": "CompositeElement",
183+
"element_id": "aa482034de5ade41b7223bb3beeb6a22",
184+
"text": "THE\n\nCONSTITUTION of the United States\n\nG\n\nNATIONAL
185+
CONSTITUTION CENTER\n\nWe the People of the United States, in
186+
Order to form a more perfect Union, establish Justice, insure
187+
...<full-content-redacted-for-brevity>...",
188+
"metadata": {
189+
"filename": "constitution.pdf",
190+
"filetype": "application/pdf",
191+
"languages": [
192+
"eng"
193+
],
194+
"page_number": 1
195+
}
196+
},
197+
{
198+
"type": "CompositeElement",
199+
"element_id": "59fbfcfb51e52c426df4c48a620c6031",
200+
"text": "SECTION. 2\n\nThe House of Representatives shall be
201+
composed of Mem- bers chosen every second Year by the People
202+
of the several States, and the Electors in each State shall
203+
...<full-content-redacted-for-brevity>...",
204+
"metadata": {
205+
"filename": "constitution.pdf",
206+
"filetype": "application/pdf",
207+
"languages": [
208+
"eng"
209+
],
210+
"page_number": 2
211+
}
212+
},
213+
```
214+
215+
Applying contextual chunking to those same elements would result in the following output.
216+
Line breaks and blank lines have been inserted here for readability. The output will not contain these line breaks and blank lines:
217+
218+
```json
219+
{
220+
"type": "CompositeElement",
221+
"element_id": "063ed41d2a989191f2281b2d35c4b4ae",
222+
"text": "Prefix: This is the opening preamble and first section of
223+
Article I of the U.S. Constitution, establishing the fundamental
224+
purpose of the document and the basic structure of legislative
225+
power in Congress. It appears at the very beginning of the main
226+
constitutional text, before all other articles and amendments.;
227+
228+
Original: THE\n\nCONSTITUTION of the United States\n\nG\n\nNATIONAL
229+
CONSTITUTION CENTER\n\nWe the People of the United States, in
230+
Order to form a more perfect Union, establish Justice, insure
231+
...<full-content-redacted-for-brevity>...",
232+
"metadata": {
233+
"filename": "constitution.pdf",
234+
"filetype": "application/pdf",
235+
"languages": [
236+
"eng"
237+
],
238+
"page_number": 1
239+
}
240+
},
241+
{
242+
"type": "CompositeElement",
243+
"element_id": "2270f6b8c8b4afc668f6277789370ffd",
244+
"text": "Prefix: This chunk appears in Article I, Section 2 of the
245+
U.S. Constitution, which establishes the structure, composition,
246+
and powers of the House of Representatives as one of the two
247+
chambers of Congress. It follows Section 1's establishment of
248+
Congress and precedes Section 3's establishment of the Senate.;
249+
250+
Original: SECTION. 2\n\nThe House of Representatives shall be
251+
composed of Mem- bers chosen every second Year by the People
252+
of the several States, and the Electors in each State shall
253+
...<full-content-redacted-for-brevity>...",
254+
"metadata": {
255+
"filename": "constitution.pdf",
256+
"filetype": "application/pdf",
257+
"languages": [
258+
"eng"
259+
],
260+
"page_number": 2
261+
}
262+
}
263+
264+
```
265+
162266
## Learn more
163267

164268
<Icon icon="blog" />&nbsp;&nbsp;[Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices).

ui/workflows.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
241241

242242
- **Chunk by title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
243243

244+
- **Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
244245
- **Combine text under n chars**: Combine elements until a section reaches a length of this many characters. The default is **0**.
245246
- **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
246247
- **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **2048**.
@@ -251,6 +252,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
251252

252253
- **Chunk by character** (also known as _basic_ chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
253254

255+
- **Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
254256
- **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
255257
- **Max characters**: Cut off new sections after reaching a length of this many characters. The default is **2048**.
256258
- **New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*.
@@ -259,6 +261,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
259261

260262
- **Chunk by page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
261263

264+
- **Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
262265
- **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
263266
- **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
264267
- **New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **50**.
@@ -267,6 +270,7 @@ import PlatformPartitioningStrategies from '/snippets/general-shared-text/platfo
267270

268271
- **Chunk by similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
269272

273+
- **Contextual chunking**: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking).
270274
- **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
271275
- **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
272276
- **Similarity threshold**: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is **0.5**. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).

0 commit comments

Comments
 (0)