vllm-project · x22x22 · Jul 11, 2025 · Jul 11, 2025 · Jul 11, 2025 · Jul 11, 2025
@@ -32,6 +32,177 @@ we attempt to override the default pooler based on its Sentence Transformers con
     You can customize the model's pooling method via the `--override-pooler-config` option,
     which takes priority over both the model's and Sentence Transformers's defaults.
 
+## Chunked Processing for Long Text
+
+vLLM supports **chunked processing** for embedding models to handle text inputs that exceed the model's maximum token length. This feature automatically splits long text into manageable chunks, processes them separately, and aggregates the results.
+
+### Supported Models
+
+Chunked processing is supported for the following embedding models:
+
+- `intfloat/multilingual-e5-large` (Recommended pool type: `MEAN`)
+- `jinaai/jina-embeddings-v3` (Recommended pool type: `MEAN`)  
+- `jinaai/jina-embeddings-v4-vllm-retrieval` (Recommended pool type: `MEAN`)
+- `Qwen/Qwen3-Embedding-4B` (Recommended pool type: `MEAN`)
+
+Other embedding models can be extended to support this feature by ensuring proper pooling type compatibility.
+
+### How Chunked Processing Works
+
+1. **Flexible Input Validation**: Configure `max_embed_len` to accept inputs longer than `max_model_len` without environment variables
+2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity  
+3. **Parallel Processing**: Each chunk is processed independently through the model
+4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts
+5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing
+
+### Configuration
+
+Enable chunked processing and configure maximum embedding input length:
+
+```bash
+# MEAN pooling (recommended for chunked processing)
+vllm serve intfloat/multilingual-e5-large \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 3072000}' \
+  --trust-remote-code
+
+# CLS pooling (processes only first chunk)
+vllm serve BAAI/bge-large-en-v1.5 \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576, "allow_non_mean_chunking": true}' \
+  --trust-remote-code
+```
+
+#### Configuration Parameters
+
+- `enable_chunked_processing`: Enable chunked processing for long inputs (default: `false`)
+- `max_embed_len`: Maximum input length allowed for embedding generation (default: `null`)
+  - When set, allows inputs longer than `max_model_len` without requiring `VLLM_ALLOW_LONG_MAX_MODEL_LEN`
+  - Inputs exceeding `max_embed_len` are rejected with clear error messages
+  - Chunking is triggered when inputs exceed `max_position_embeddings`
+- `allow_non_mean_chunking`: Allow non-MEAN pooling types with chunked processing (default: `false`)
+  - When `false`: CLS/LAST pooling types show warnings and may be disabled
+  - When `true`: Explicitly enables CLS/LAST pooling with performance optimizations
+  - Required to suppress warnings for non-MEAN pooling types
+
+### Aggregation Algorithm
+
+The chunked processing uses different strategies based on pooling type:
+
+#### MEAN Pooling (Recommended)
+Uses weighted averaging across all chunks:
+
+```python
+# Weighted average: sum(embedding_i * token_count_i) / total_tokens
+weighted_sum = sum(embeddings[i] * weights[i] for i in range(num_chunks))
+final_embedding = weighted_sum / sum(weights)
+```
+
+This ensures that longer chunks contribute proportionally more to the final representation.
+
+#### CLS Pooling (Performance Optimized)
+Only processes the **first chunk** to avoid computational waste:
+
+```python
+# CLS pooling: only the first chunk contains the CLS token
+final_embedding = first_chunk_embedding
+```
+
+Note: This may lose information from later parts of the text.
+
+#### LAST Pooling (Performance Optimized)
+Only processes the **last chunk** to avoid computational waste:
+
+```python
+# LAST pooling: only the last chunk contains the final token
+final_embedding = last_chunk_embedding
+```
+
+Note: This may lose information from earlier parts of the text.
+
+### Performance Characteristics
+
+| Pooling Type | Chunks Processed | Processing Time | Semantic Coverage | Best Use Case |
+|--------------|------------------|-----------------|-------------------|---------------|
+| **MEAN** | All chunks | Highest (all chunks) | Complete | General purpose, long documents |
+| **CLS** | First chunk only | Lowest (1 chunk) | Limited to start | Classification, when start matters |
+| **LAST** | Last chunk only | Lowest (1 chunk) | Limited to end | When ending matters |
+
+| Aspect | Short Text (≤ max_position_embeddings) | Long Text (> max_position_embeddings) |
+|--------|----------------------------------------|---------------------------------------|
+| **Processing Time** | Standard | Varies by pooling type (CLS/LAST: minimal, MEAN: increased) |
+| **Memory Usage** | Standard | Reduced (chunks processed separately) |
+| **Quality** | Standard | Depends on pooling type and content distribution |
+| **Compatibility** | Full | Full (backward compatible) |
+| **Input Validation** | Standard max_model_len check | Extended max_embed_len check |
+
+#### Extreme Long Text Support
+
+With the enhanced `max_embed_len` configuration (up to 3M+ tokens), you can process:
+- **Complete Documents**: Research papers, legal contracts, technical manuals
+- **Large Codebases**: Entire repositories and documentation
+- **Books and Literature**: Full chapters or small books
+- **Multi-document Analysis**: Combined content for comprehensive understanding
+
+### Example Usage
+
+#### Basic Configuration
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="your-api-key",
+    base_url="http://localhost:31090/v1"
+)
+
+# This will automatically use chunked processing for very long text
+# max_embed_len=3072000 allows inputs up to 3M+ tokens
+response = client.embeddings.create(
+    input="Very long text that exceeds the model's position embeddings..." * 5000,
+    model="multilingual-e5-large"
+)
+
+print(f"Embedding dimension: {len(response.data[0].embedding)}")
+```
+
+#### Alternative Model Configurations
+
+```bash
+# For Jina embeddings v3 (optimized for performance)
+vllm serve jinaai/jina-embeddings-v3 \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1048576}' \
+  --trust-remote-code
+
+# For Jina embeddings v4 (latest retrieval model)  
+vllm serve jinaai/jina-embeddings-v4-vllm-retrieval \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 2097152}' \
+  --trust-remote-code
+
+# For Qwen3 Embedding (large-scale multilingual)
+vllm serve Qwen/Qwen3-Embedding-4B \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "MEAN", "normalize": true, "enable_chunked_processing": true, "max_embed_len": 1572864}' \
+  --trust-remote-code
+```
+
+### Logging and Monitoring
+
+When chunked processing is active, you'll see informative log messages:
+
+```
+INFO: Input length 100000 exceeds max_position_embeddings 512, will use chunked processing
+INFO: Split input of 100000 tokens into 196 chunks (max_chunk_size: 512)
+```
+
+### Limitations
+
+- **Increased Latency**: Processing multiple chunks takes longer than single-chunk processing
+- **Model Support**: Currently limited to specific embedding models
+- **Context Boundaries**: Chunking may split related content, though weighted averaging helps preserve overall semantics
+
 ## Offline Inference
 
 The [LLM][vllm.LLM] class provides various methods for offline inference.
@@ -170,7 +341,7 @@ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
 You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.
 
 ```text
-curl http://127.0.0.1:8000/v1/embeddings \
+curl http://127.0.0.1:31090/v1/embeddings \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{

@@ -422,7 +422,7 @@ Specified using `--task embed`.
 | `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. |  |  |  |
 | `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. |  |  |  |
 | `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. |  |  |  |
-| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, `intfloat/multilingual-e5-large` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
@@ -441,6 +441,9 @@ Specified using `--task embed`.
 !!! note
     The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
 
+!!! note
+    `intfloat/multilingual-e5-large` supports **long text embedding** with chunked processing. When input text exceeds the model's maximum length, the model automatically splits the input into chunks and processes them separately, then aggregates the results. Enable this feature with `--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}'`. See the [Chunked Processing section](pooling_models.md#chunked-processing-for-long-text) for more details.
+
 If your model is not in the above list, we will try to automatically convert the model using
 [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
 of the whole prompt are extracted from the normalized hidden state corresponding to the last token.