vllm-project · x22x22 · Jul 11, 2025 · Jul 11, 2025 · Jul 11, 2025 · Jul 11, 2025
@@ -32,6 +32,137 @@ we attempt to override the default pooler based on its Sentence Transformers con
     You can customize the model's pooling method via the `--override-pooler-config` option,
     which takes priority over both the model's and Sentence Transformers's defaults.
 
+## Chunked Processing for Long Text
+
+vLLM supports **chunked processing** for embedding models to handle text inputs that exceed the model's maximum token length. This feature automatically splits long text into manageable chunks, processes them separately, and aggregates the results.
+
+### Supported Models
+
+Chunked processing is supported for the following embedding models:
+
+- `intfloat/multilingual-e5-large` (Recommended pool type: `MEAN`)
+- `jinaai/jina-embeddings-v3` (Recommended pool type: `MEAN`)  
+- `jinaai/jina-embeddings-v4-vllm-retrieval` (Recommended pool type: `MEAN`)
+- `Qwen/Qwen3-Embedding-4B` (Recommended pool type: `MEAN`)
+
+Other embedding models can be extended to support this feature by ensuring proper pooling type compatibility.
+
+### How Chunked Processing Works
+
+1. **Flexible Input Validation**: Configure `max_embed_len` to accept inputs longer than `max_model_len` without environment variables
+2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity  
+3. **Parallel Processing**: Each chunk is processed independently through the model
+4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts
+5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing
+
+### Configuration
+
+Enable chunked processing and configure maximum embedding input length:
+
+```bash
+vllm serve intfloat/multilingual-e5-large \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000}' \
+  --trust-remote-code
+```
+
+#### Configuration Parameters
+
+- `enable_chunked_processing`: Enable chunked processing for long inputs (default: `false`)
+- `max_embed_len`: Maximum input length allowed for embedding generation (default: `null`)
+  - When set, allows inputs longer than `max_model_len` without requiring `VLLM_ALLOW_LONG_MAX_MODEL_LEN`
+  - Inputs exceeding `max_embed_len` are rejected with clear error messages
+  - Chunking is triggered when inputs exceed `max_position_embeddings`
+
+### Aggregation Algorithm
+
+The chunked processing uses a FastChat-inspired weighted averaging algorithm:
+
+```python
+# Weighted average: sum(embedding_i * token_count_i) / total_tokens
+weighted_sum = sum(embeddings[i] * weights[i] for i in range(num_chunks))
+final_embedding = weighted_sum / sum(weights)
+```
+
+This ensures that longer chunks contribute proportionally more to the final representation.
+
+### Performance Characteristics
+
+| Aspect | Short Text (≤ max_position_embeddings) | Long Text (> max_position_embeddings) |
+|--------|----------------------------------------|---------------------------------------|
+| **Processing Time** | Standard | Increased (multiple inference calls) |
+| **Memory Usage** | Standard | Reduced (chunks processed separately) |
+| **Quality** | Standard | Maintains semantic representation |
+| **Compatibility** | Full | Full (backward compatible) |
+| **Input Validation** | Standard max_model_len check | Extended max_embed_len check |
+
+#### Extreme Long Text Support
+
+With the enhanced `max_embed_len` configuration (up to 3M+ tokens), you can process:
+- **Complete Documents**: Research papers, legal contracts, technical manuals
+- **Large Codebases**: Entire repositories and documentation
+- **Books and Literature**: Full chapters or small books
+- **Multi-document Analysis**: Combined content for comprehensive understanding
+
+### Example Usage
+
+#### Basic Configuration
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="your-api-key",
+    base_url="http://localhost:31090/v1"
+)
+
+# This will automatically use chunked processing for very long text
+# max_embed_len=3072000 allows inputs up to 3M+ tokens
+response = client.embeddings.create(
+    input="Very long text that exceeds the model's position embeddings..." * 5000,
+    model="multilingual-e5-large"
+)
+
+print(f"Embedding dimension: {len(response.data[0].embedding)}")
+```
+
+#### Alternative Model Configurations
+
+```bash
+# For Jina embeddings v3 (optimized for performance)
+vllm serve jinaai/jina-embeddings-v3 \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576}' \
+  --trust-remote-code
+
+# For Jina embeddings v4 (latest retrieval model)  
+vllm serve jinaai/jina-embeddings-v4-vllm-retrieval \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 2097152}' \
+  --trust-remote-code
+
+# For Qwen3 Embedding (large-scale multilingual)
+vllm serve Qwen/Qwen3-Embedding-4B \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1572864}' \
+  --trust-remote-code
+```
+
+### Logging and Monitoring
+
+When chunked processing is active, you'll see informative log messages:
+
+```
+INFO: Input length 100000 exceeds max_position_embeddings 512, will use chunked processing
+INFO: Split input of 100000 tokens into 196 chunks (max_chunk_size: 512)
+```
+
+### Limitations
+
+- **Increased Latency**: Processing multiple chunks takes longer than single-chunk processing
+- **Model Support**: Currently limited to specific embedding models
+- **Context Boundaries**: Chunking may split related content, though weighted averaging helps preserve overall semantics
+
 ## Offline Inference
 
 The [LLM][vllm.LLM] class provides various methods for offline inference.
@@ -170,7 +301,7 @@ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
 You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.
 
 ```text
-curl http://127.0.0.1:8000/v1/embeddings \
+curl http://127.0.0.1:31090/v1/embeddings \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{

@@ -418,7 +418,7 @@ Specified using `--task embed`.
 | `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. |  |  |  |
 | `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. |  |  |  |
 | `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. |  |  |  |
-| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, `intfloat/multilingual-e5-large` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
@@ -437,6 +437,9 @@ Specified using `--task embed`.
 !!! note
     The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
 
+!!! note
+    `intfloat/multilingual-e5-large` supports **long text embedding** with chunked processing. When input text exceeds the model's maximum length, the model automatically splits the input into chunks and processes them separately, then aggregates the results. Enable this feature with `--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}'`. See the [Chunked Processing section](pooling_models.md#chunked-processing-for-long-text) for more details.
+
 If your model is not in the above list, we will try to automatically convert the model using
 [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
 of the whole prompt are extracted from the normalized hidden state corresponding to the last token.

diff --git a/examples/online_serving/openai_embedding_long_text.md b/examples/online_serving/openai_embedding_long_text.md
@@ -0,0 +1,179 @@
+# Long Text Embedding with Chunked Processing
+
+This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length.
+
+## 🚀 Quick Start
+
+### 1. Start the Server
+
+Use the provided script to start a vLLM server with chunked processing enabled:
+
+```bash
+# Basic usage (supports very long texts up to ~3M tokens)
+./openai_embedding_long_text_service.sh
+
+# Custom configuration with different models
+MODEL_NAME="jinaai/jina-embeddings-v3" \
+MAX_EMBED_LEN=1048576 \
+./openai_embedding_long_text_service.sh
+
+# For extremely long documents
+MODEL_NAME="intfloat/multilingual-e5-large" \
+MAX_EMBED_LEN=3072000 \
+./openai_embedding_long_text_service.sh
+```
+
+### 2. Test Long Text Embedding
+
+Run the comprehensive test client:
+
+```bash
+python openai_embedding_long_text_client.py
+```
+
+## 📁 Files
+
+| File | Description |
+|------|-------------|
+| `openai_embedding_long_text_service.sh` | Server startup script with chunked processing enabled |
+| `openai_embedding_long_text_client.py` | Comprehensive test client for long text embedding |
+| `openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) |
+
+## ⚙️ Configuration
+
+### Server Configuration
+
+The key parameters for chunked processing are in the `--override-pooler-config`:
+
+```json
+{
+  "pooling_type": "MEAN",
+  "normalize": true,
+  "enable_chunked_processing": true,
+  "max_embed_len": 3072000
+}
+```
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
+| `PORT` | `31090` | Server port |
+| `GPU_COUNT` | `1` | Number of GPUs to use |
+| `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) |
+| `API_KEY` | `EMPTY` | API key for authentication |
+
+## 🔧 How It Works
+
+1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables
+2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity
+3. **Independent Processing**: Each chunk is processed separately through the model
+4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging
+5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing
+
+### Input Length Handling
+
+- **Within max_embed_len**: Input is accepted and processed (up to 3M+ tokens)
+- **Exceeds max_position_embeddings**: Chunked processing is automatically triggered
+- **Exceeds max_embed_len**: Input is rejected with clear error message
+- **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN`
+
+### Extreme Long Text Support
+
+With `MAX_EMBED_LEN=3072000`, you can process:
+- **Academic papers**: Full research papers with references
+- **Legal documents**: Complete contracts and legal texts  
+- **Books**: Entire chapters or small books
+- **Code repositories**: Large codebases and documentation
+
+## 📊 Performance Characteristics
+
+| Text Length | Processing Method | Memory Usage | Speed |
+|-------------|------------------|--------------|-------|
+| ≤ max_position_embeddings | Standard | Normal | Fast |
+| > max_position_embeddings, ≤ max_embed_len | Chunked | Reduced per chunk | Slower (multiple inferences) |
+| > max_embed_len | Rejected | N/A | Error response |
+
+## 🧪 Test Cases
+
+The test client demonstrates:
+
+- ✅ **Short text**: Normal processing (baseline)
+- ✅ **Medium text**: Single chunk processing
+- ✅ **Long text**: Multi-chunk processing with aggregation
+- ✅ **Very long text**: Many chunks processing
+- ✅ **Extreme long text**: Document-level processing (100K+ tokens)
+- ✅ **Batch processing**: Mixed-length inputs in one request
+- ✅ **Consistency**: Reproducible results across runs
+
+## 🐛 Troubleshooting
+
+### Common Issues
+
+1. **Chunked processing not enabled**:
+
+   ```
+   ValueError: This model's maximum position embeddings length is 4096 tokens...
+   ```
+
+   **Solution**: Ensure `enable_chunked_processing: true` in pooler config
+
+2. **Input exceeds max_embed_len**:
+
+   ```
+   ValueError: This model's maximum embedding input length is 3072000 tokens...
+   ```
+
+   **Solution**: Increase `max_embed_len` in pooler config or reduce input length
+
+3. **Memory errors**:
+
+   ```
+   RuntimeError: CUDA out of memory
+   ```
+
+   **Solution**: Reduce chunk size by adjusting model's `max_position_embeddings` or use fewer GPUs
+
+4. **Slow processing**:
+   **Expected**: Long text takes more time due to multiple inference calls
+
+### Debug Information
+
+Server logs show chunked processing activity:
+
+```
+INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
+INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)
+```
+
+## 📚 Additional Resources
+
+- [Pooling Models Documentation](../../docs/models/pooling_models.md#chunked-processing-for-long-text)
+- [Supported Models List](../../docs/models/supported_models.md#text-embedding)
+- [Original Feature Documentation](../../README_CHUNKED_PROCESSING.md)
+
+## 🤝 Contributing
+
+To extend chunked processing support to other embedding models:
+
+1. Check model compatibility with the pooling architecture
+2. Test with various text lengths
+3. Validate embedding quality compared to single-chunk processing
+4. Submit PR with test cases and documentation updates
+
+## 🆕 Enhanced Features
+
+### max_embed_len Parameter
+
+The new `max_embed_len` parameter provides:
+
+- **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable
+- **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len`
+- **Extreme Length Support**: Process documents with millions of tokens
+- **Clear Error Messages**: Better feedback when inputs exceed limits
+- **Backward Compatibility**: Existing configurations continue to work
+
+---
+
+**Note**: Chunked processing is currently supported for specific embedding models. See the [supported models documentation](../../docs/models/supported_models.md#chunked-processing-for-long-text) for the complete list.