Skip to content

[Frontend] Add chunked processing to handle long inputs in embedding models #20837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
5398bbd
Add a chunking processing function that supports long - text embeddin…
x22x22 Jul 11, 2025
2b80b14
Rectify the code formatting issues, disable yapf to prevent conflicts…
x22x22 Jul 11, 2025
b7e10b8
Optimize the embedding processing logic, add checks for text token pr…
x22x22 Jul 11, 2025
39d2abd
Added multiple long-text batch processing tests to verify the uniquen…
x22x22 Jul 11, 2025
327f700
Added multiple long-text batch processing tests to verify the uniquen…
x22x22 Jul 11, 2025
85c28b9
Rectify the numbering errors in the document by changing the number o…
x22x22 Jul 11, 2025
f36047d
Update the long - text service script. Add a new variable named MODEL…
x22x22 Jul 11, 2025
da81267
Multiple long - text batch processing tests have been newly added to …
x22x22 Jul 11, 2025
5573882
Update the documentation and examples to support the new `max_embed_l…
x22x22 Jul 13, 2025
4cbcf90
Update the example code to support the new `max_embed_len` parameter,…
x22x22 Jul 13, 2025
a5432ac
The documentation and examples have been updated to support the enhan…
x22x22 Jul 14, 2025
d7924b9
fix(embedding): optimize LAST/CLS pooling in chunked processing
x22x22 Jul 15, 2025
b2116bd
fix: implement online aggregation for chunked embedding processing
x22x22 Jul 15, 2025
6e5d8ee
fix pre-commit errors
x22x22 Jul 15, 2025
4eb3bef
Update the documentation and examples to support the enhanced chunk p…
x22x22 Jul 18, 2025
681e39d
Merge main into feat/support-long-text-embedding - resolve conflicts
x22x22 Jul 18, 2025
2a39548
In the EmbeddingMixin class, add validation for pooling parameters to…
x22x22 Jul 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 172 additions & 1 deletion docs/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,177 @@ we attempt to override the default pooler based on its Sentence Transformers con
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.

## Chunked Processing for Long Text

vLLM supports **chunked processing** for embedding models to handle text inputs that exceed the model's maximum token length. This feature automatically splits long text into manageable chunks, processes them separately, and aggregates the results.

### Supported Models

Chunked processing is supported for the following embedding models:

- `intfloat/multilingual-e5-large` (Recommended pool type: `MEAN`)
- `jinaai/jina-embeddings-v3` (Recommended pool type: `MEAN`)
- `jinaai/jina-embeddings-v4-vllm-retrieval` (Recommended pool type: `MEAN`)
- `Qwen/Qwen3-Embedding-4B` (Recommended pool type: `MEAN`)

Other embedding models can be extended to support this feature by ensuring proper pooling type compatibility.

### How Chunked Processing Works

1. **Flexible Input Validation**: Configure `max_embed_len` to accept inputs longer than `max_model_len` without environment variables
2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity
3. **Parallel Processing**: Each chunk is processed independently through the model
4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts
5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing

### Configuration

Enable chunked processing and configure maximum embedding input length:

```bash
# MEAN pooling (recommended for chunked processing)
vllm serve intfloat/multilingual-e5-large \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000}' \
--trust-remote-code

# CLS pooling (processes only first chunk)
vllm serve BAAI/bge-large-en-v1.5 \
--task embed \
--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576, "allow_non_mean_chunking": true}' \
--trust-remote-code
```

#### Configuration Parameters

- `enable_chunked_processing`: Enable chunked processing for long inputs (default: `false`)
- `max_embed_len`: Maximum input length allowed for embedding generation (default: `null`)
- When set, allows inputs longer than `max_model_len` without requiring `VLLM_ALLOW_LONG_MAX_MODEL_LEN`
- Inputs exceeding `max_embed_len` are rejected with clear error messages
- Chunking is triggered when inputs exceed `max_position_embeddings`
- `allow_non_mean_chunking`: Allow non-MEAN pooling types with chunked processing (default: `false`)
- When `false`: CLS/LAST pooling types show warnings and may be disabled
- When `true`: Explicitly enables CLS/LAST pooling with performance optimizations
- Required to suppress warnings for non-MEAN pooling types

### Aggregation Algorithm

The chunked processing uses different strategies based on pooling type:

#### MEAN Pooling (Recommended)
Uses weighted averaging across all chunks:

```python
# Weighted average: sum(embedding_i * token_count_i) / total_tokens
weighted_sum = sum(embeddings[i] * weights[i] for i in range(num_chunks))
final_embedding = weighted_sum / sum(weights)
```

This ensures that longer chunks contribute proportionally more to the final representation.

#### CLS Pooling (Performance Optimized)
Only processes the **first chunk** to avoid computational waste:

```python
# CLS pooling: only the first chunk contains the CLS token
final_embedding = first_chunk_embedding
```

Note: This may lose information from later parts of the text.

#### LAST Pooling (Performance Optimized)
Only processes the **last chunk** to avoid computational waste:

```python
# LAST pooling: only the last chunk contains the final token
final_embedding = last_chunk_embedding
```

Note: This may lose information from earlier parts of the text.

### Performance Characteristics

| Pooling Type | Chunks Processed | Processing Time | Semantic Coverage | Best Use Case |
|--------------|------------------|-----------------|-------------------|---------------|
| **MEAN** | All chunks | Highest (all chunks) | Complete | General purpose, long documents |
| **CLS** | First chunk only | Lowest (1 chunk) | Limited to start | Classification, when start matters |
| **LAST** | Last chunk only | Lowest (1 chunk) | Limited to end | When ending matters |

| Aspect | Short Text (≤ max_position_embeddings) | Long Text (> max_position_embeddings) |
|--------|----------------------------------------|---------------------------------------|
| **Processing Time** | Standard | Varies by pooling type (CLS/LAST: minimal, MEAN: increased) |
| **Memory Usage** | Standard | Reduced (chunks processed separately) |
| **Quality** | Standard | Depends on pooling type and content distribution |
| **Compatibility** | Full | Full (backward compatible) |
| **Input Validation** | Standard max_model_len check | Extended max_embed_len check |

#### Extreme Long Text Support

With the enhanced `max_embed_len` configuration (up to 3M+ tokens), you can process:
- **Complete Documents**: Research papers, legal contracts, technical manuals
- **Large Codebases**: Entire repositories and documentation
- **Books and Literature**: Full chapters or small books
- **Multi-document Analysis**: Combined content for comprehensive understanding

### Example Usage

#### Basic Configuration

```python
from openai import OpenAI

client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:31090/v1"
)

# This will automatically use chunked processing for very long text
# max_embed_len=3072000 allows inputs up to 3M+ tokens
response = client.embeddings.create(
input="Very long text that exceeds the model's position embeddings..." * 5000,
model="multilingual-e5-large"
)

print(f"Embedding dimension: {len(response.data[0].embedding)}")
```

#### Alternative Model Configurations

```bash
# For Jina embeddings v3 (optimized for performance)
vllm serve jinaai/jina-embeddings-v3 \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576}' \
--trust-remote-code

# For Jina embeddings v4 (latest retrieval model)
vllm serve jinaai/jina-embeddings-v4-vllm-retrieval \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 2097152}' \
--trust-remote-code

# For Qwen3 Embedding (large-scale multilingual)
vllm serve Qwen/Qwen3-Embedding-4B \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1572864}' \
--trust-remote-code
```

### Logging and Monitoring

When chunked processing is active, you'll see informative log messages:

```
INFO: Input length 100000 exceeds max_position_embeddings 512, will use chunked processing
INFO: Split input of 100000 tokens into 196 chunks (max_chunk_size: 512)
```

### Limitations

- **Increased Latency**: Processing multiple chunks takes longer than single-chunk processing
- **Model Support**: Currently limited to specific embedding models
- **Context Boundaries**: Chunking may split related content, though weighted averaging helps preserve overall semantics

## Offline Inference

The [LLM][vllm.LLM] class provides various methods for offline inference.
Expand Down Expand Up @@ -170,7 +341,7 @@ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

```text
curl http://127.0.0.1:8000/v1/embeddings \
curl http://127.0.0.1:31090/v1/embeddings \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
Expand Down
5 changes: 4 additions & 1 deletion docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,7 @@ Specified using `--task embed`.
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, `intfloat/multilingual-e5-large` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
Expand All @@ -441,6 +441,9 @@ Specified using `--task embed`.
!!! note
The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.

!!! note
`intfloat/multilingual-e5-large` supports **long text embedding** with chunked processing. When input text exceeds the model's maximum length, the model automatically splits the input into chunks and processes them separately, then aggregates the results. Enable this feature with `--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}'`. See the [Chunked Processing section](pooling_models.md#chunked-processing-for-long-text) for more details.

If your model is not in the above list, we will try to automatically convert the model using
[as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
Expand Down
Loading