Skip to content

[Frontend] Add chunked processing to handle long inputs in embedding models #20837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
5398bbd
Add a chunking processing function that supports long - text embeddin…
x22x22 Jul 11, 2025
2b80b14
Rectify the code formatting issues, disable yapf to prevent conflicts…
x22x22 Jul 11, 2025
b7e10b8
Optimize the embedding processing logic, add checks for text token pr…
x22x22 Jul 11, 2025
39d2abd
Added multiple long-text batch processing tests to verify the uniquen…
x22x22 Jul 11, 2025
327f700
Added multiple long-text batch processing tests to verify the uniquen…
x22x22 Jul 11, 2025
85c28b9
Rectify the numbering errors in the document by changing the number o…
x22x22 Jul 11, 2025
f36047d
Update the long - text service script. Add a new variable named MODEL…
x22x22 Jul 11, 2025
da81267
Multiple long - text batch processing tests have been newly added to …
x22x22 Jul 11, 2025
5573882
Update the documentation and examples to support the new `max_embed_l…
x22x22 Jul 13, 2025
4cbcf90
Update the example code to support the new `max_embed_len` parameter,…
x22x22 Jul 13, 2025
a5432ac
The documentation and examples have been updated to support the enhan…
x22x22 Jul 14, 2025
d7924b9
fix(embedding): optimize LAST/CLS pooling in chunked processing
x22x22 Jul 15, 2025
b2116bd
fix: implement online aggregation for chunked embedding processing
x22x22 Jul 15, 2025
6e5d8ee
fix pre-commit errors
x22x22 Jul 15, 2025
4eb3bef
Update the documentation and examples to support the enhanced chunk p…
x22x22 Jul 18, 2025
681e39d
Merge main into feat/support-long-text-embedding - resolve conflicts
x22x22 Jul 18, 2025
2a39548
In the EmbeddingMixin class, add validation for pooling parameters to…
x22x22 Jul 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 132 additions & 1 deletion docs/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,137 @@ we attempt to override the default pooler based on its Sentence Transformers con
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.

## Chunked Processing for Long Text

vLLM supports **chunked processing** for embedding models to handle text inputs that exceed the model's maximum token length. This feature automatically splits long text into manageable chunks, processes them separately, and aggregates the results.

### Supported Models

Chunked processing is supported for the following embedding models:

- `intfloat/multilingual-e5-large` (Recommended pool type: `MEAN`)
- `jinaai/jina-embeddings-v3` (Recommended pool type: `MEAN`)
- `jinaai/jina-embeddings-v4-vllm-retrieval` (Recommended pool type: `MEAN`)
- `Qwen/Qwen3-Embedding-4B` (Recommended pool type: `MEAN`)

Other embedding models can be extended to support this feature by ensuring proper pooling type compatibility.

### How Chunked Processing Works

1. **Flexible Input Validation**: Configure `max_embed_len` to accept inputs longer than `max_model_len` without environment variables
2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity
3. **Parallel Processing**: Each chunk is processed independently through the model
4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts
5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing

### Configuration

Enable chunked processing and configure maximum embedding input length:

```bash
vllm serve intfloat/multilingual-e5-large \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000}' \
--trust-remote-code
```

#### Configuration Parameters

- `enable_chunked_processing`: Enable chunked processing for long inputs (default: `false`)
- `max_embed_len`: Maximum input length allowed for embedding generation (default: `null`)
- When set, allows inputs longer than `max_model_len` without requiring `VLLM_ALLOW_LONG_MAX_MODEL_LEN`
- Inputs exceeding `max_embed_len` are rejected with clear error messages
- Chunking is triggered when inputs exceed `max_position_embeddings`

### Aggregation Algorithm

The chunked processing uses a FastChat-inspired weighted averaging algorithm:

```python
# Weighted average: sum(embedding_i * token_count_i) / total_tokens
weighted_sum = sum(embeddings[i] * weights[i] for i in range(num_chunks))
final_embedding = weighted_sum / sum(weights)
```

This ensures that longer chunks contribute proportionally more to the final representation.

### Performance Characteristics

| Aspect | Short Text (≤ max_position_embeddings) | Long Text (> max_position_embeddings) |
|--------|----------------------------------------|---------------------------------------|
| **Processing Time** | Standard | Increased (multiple inference calls) |
| **Memory Usage** | Standard | Reduced (chunks processed separately) |
| **Quality** | Standard | Maintains semantic representation |
| **Compatibility** | Full | Full (backward compatible) |
| **Input Validation** | Standard max_model_len check | Extended max_embed_len check |

#### Extreme Long Text Support

With the enhanced `max_embed_len` configuration (up to 3M+ tokens), you can process:
- **Complete Documents**: Research papers, legal contracts, technical manuals
- **Large Codebases**: Entire repositories and documentation
- **Books and Literature**: Full chapters or small books
- **Multi-document Analysis**: Combined content for comprehensive understanding

### Example Usage

#### Basic Configuration

```python
from openai import OpenAI

client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:31090/v1"
)

# This will automatically use chunked processing for very long text
# max_embed_len=3072000 allows inputs up to 3M+ tokens
response = client.embeddings.create(
input="Very long text that exceeds the model's position embeddings..." * 5000,
model="multilingual-e5-large"
)

print(f"Embedding dimension: {len(response.data[0].embedding)}")
```

#### Alternative Model Configurations

```bash
# For Jina embeddings v3 (optimized for performance)
vllm serve jinaai/jina-embeddings-v3 \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576}' \
--trust-remote-code

# For Jina embeddings v4 (latest retrieval model)
vllm serve jinaai/jina-embeddings-v4-vllm-retrieval \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 2097152}' \
--trust-remote-code

# For Qwen3 Embedding (large-scale multilingual)
vllm serve Qwen/Qwen3-Embedding-4B \
--task embed \
--override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1572864}' \
--trust-remote-code
```

### Logging and Monitoring

When chunked processing is active, you'll see informative log messages:

```
INFO: Input length 100000 exceeds max_position_embeddings 512, will use chunked processing
INFO: Split input of 100000 tokens into 196 chunks (max_chunk_size: 512)
```

### Limitations

- **Increased Latency**: Processing multiple chunks takes longer than single-chunk processing
- **Model Support**: Currently limited to specific embedding models
- **Context Boundaries**: Chunking may split related content, though weighted averaging helps preserve overall semantics

## Offline Inference

The [LLM][vllm.LLM] class provides various methods for offline inference.
Expand Down Expand Up @@ -170,7 +301,7 @@ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

```text
curl http://127.0.0.1:8000/v1/embeddings \
curl http://127.0.0.1:31090/v1/embeddings \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
Expand Down
5 changes: 4 additions & 1 deletion docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ Specified using `--task embed`.
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, `intfloat/multilingual-e5-large` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
Expand All @@ -437,6 +437,9 @@ Specified using `--task embed`.
!!! note
The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.

!!! note
`intfloat/multilingual-e5-large` supports **long text embedding** with chunked processing. When input text exceeds the model's maximum length, the model automatically splits the input into chunks and processes them separately, then aggregates the results. Enable this feature with `--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}'`. See the [Chunked Processing section](pooling_models.md#chunked-processing-for-long-text) for more details.

If your model is not in the above list, we will try to automatically convert the model using
[as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
Expand Down
179 changes: 179 additions & 0 deletions examples/online_serving/openai_embedding_long_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Long Text Embedding with Chunked Processing

This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length.

## 🚀 Quick Start

### 1. Start the Server

Use the provided script to start a vLLM server with chunked processing enabled:

```bash
# Basic usage (supports very long texts up to ~3M tokens)
./openai_embedding_long_text_service.sh

# Custom configuration with different models
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./openai_embedding_long_text_service.sh

# For extremely long documents
MODEL_NAME="intfloat/multilingual-e5-large" \
MAX_EMBED_LEN=3072000 \
./openai_embedding_long_text_service.sh
```

### 2. Test Long Text Embedding

Run the comprehensive test client:

```bash
python openai_embedding_long_text_client.py
```

## 📁 Files

| File | Description |
|------|-------------|
| `openai_embedding_long_text_service.sh` | Server startup script with chunked processing enabled |
| `openai_embedding_long_text_client.py` | Comprehensive test client for long text embedding |
| `openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) |

## ⚙️ Configuration

### Server Configuration

The key parameters for chunked processing are in the `--override-pooler-config`:

```json
{
"pooling_type": "MEAN",
"normalize": true,
"enable_chunked_processing": true,
"max_embed_len": 3072000
}
```

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
| `PORT` | `31090` | Server port |
| `GPU_COUNT` | `1` | Number of GPUs to use |
| `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) |
| `API_KEY` | `EMPTY` | API key for authentication |

## 🔧 How It Works

1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables
2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity
3. **Independent Processing**: Each chunk is processed separately through the model
4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging
5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing

### Input Length Handling

- **Within max_embed_len**: Input is accepted and processed (up to 3M+ tokens)
- **Exceeds max_position_embeddings**: Chunked processing is automatically triggered
- **Exceeds max_embed_len**: Input is rejected with clear error message
- **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN`

### Extreme Long Text Support

With `MAX_EMBED_LEN=3072000`, you can process:
- **Academic papers**: Full research papers with references
- **Legal documents**: Complete contracts and legal texts
- **Books**: Entire chapters or small books
- **Code repositories**: Large codebases and documentation

## 📊 Performance Characteristics

| Text Length | Processing Method | Memory Usage | Speed |
|-------------|------------------|--------------|-------|
| ≤ max_position_embeddings | Standard | Normal | Fast |
| > max_position_embeddings, ≤ max_embed_len | Chunked | Reduced per chunk | Slower (multiple inferences) |
| > max_embed_len | Rejected | N/A | Error response |

## 🧪 Test Cases

The test client demonstrates:

- ✅ **Short text**: Normal processing (baseline)
- ✅ **Medium text**: Single chunk processing
- ✅ **Long text**: Multi-chunk processing with aggregation
- ✅ **Very long text**: Many chunks processing
- ✅ **Extreme long text**: Document-level processing (100K+ tokens)
- ✅ **Batch processing**: Mixed-length inputs in one request
- ✅ **Consistency**: Reproducible results across runs

## 🐛 Troubleshooting

### Common Issues

1. **Chunked processing not enabled**:

```
ValueError: This model's maximum position embeddings length is 4096 tokens...
```

**Solution**: Ensure `enable_chunked_processing: true` in pooler config

2. **Input exceeds max_embed_len**:

```
ValueError: This model's maximum embedding input length is 3072000 tokens...
```

**Solution**: Increase `max_embed_len` in pooler config or reduce input length

3. **Memory errors**:

```
RuntimeError: CUDA out of memory
```

**Solution**: Reduce chunk size by adjusting model's `max_position_embeddings` or use fewer GPUs

4. **Slow processing**:
**Expected**: Long text takes more time due to multiple inference calls

### Debug Information

Server logs show chunked processing activity:

```
INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)
```

## 📚 Additional Resources

- [Pooling Models Documentation](../../docs/models/pooling_models.md#chunked-processing-for-long-text)
- [Supported Models List](../../docs/models/supported_models.md#text-embedding)
- [Original Feature Documentation](../../README_CHUNKED_PROCESSING.md)

## 🤝 Contributing

To extend chunked processing support to other embedding models:

1. Check model compatibility with the pooling architecture
2. Test with various text lengths
3. Validate embedding quality compared to single-chunk processing
4. Submit PR with test cases and documentation updates

## 🆕 Enhanced Features

### max_embed_len Parameter

The new `max_embed_len` parameter provides:

- **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable
- **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len`
- **Extreme Length Support**: Process documents with millions of tokens
- **Clear Error Messages**: Better feedback when inputs exceed limits
- **Backward Compatibility**: Existing configurations continue to work

---

**Note**: Chunked processing is currently supported for specific embedding models. See the [supported models documentation](../../docs/models/supported_models.md#chunked-processing-for-long-text) for the complete list.
Loading