vllm-project
diff --git a/‎docs/models/pooling_models.md
Lines changed: 85 additions & 1 deletion b/‎docs/models/pooling_models.md
Lines changed: 85 additions & 1 deletion
diff --git a/‎docs/models/supported_models.md
Lines changed: 4 additions & 1 deletion b/‎docs/models/supported_models.md
Lines changed: 4 additions & 1 deletion
diff --git a/‎examples/online_serving/openai_embedding_long_text.md
Lines changed: 137 additions & 0 deletions b/‎examples/online_serving/openai_embedding_long_text.md
Lines changed: 137 additions & 0 deletions
@@ -32,6 +32,90 @@ we attempt to override the default pooler based on its Sentence Transformers con
     You can customize the model's pooling method via the `--override-pooler-config` option,
     which takes priority over both the model's and Sentence Transformers's defaults.
 
+## Chunked Processing for Long Text
+
+vLLM supports **chunked processing** for embedding models to handle text inputs that exceed the model's maximum token length. This feature automatically splits long text into manageable chunks, processes them separately, and aggregates the results.
+
+### Supported Models
+
+- `intfloat/multilingual-e5-large`
+- Other embedding models can be extended to support this feature
+
+### How Chunked Processing Works
+
+1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered
+2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity  
+3. **Parallel Processing**: Each chunk is processed independently through the model
+4. **Intelligent Aggregation**: Results are combined using weighted averaging based on chunk token counts
+5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing
+
+### Configuration
+
+Enable chunked processing by setting `enable_chunked_processing: true` in the pooler configuration:
+
+```bash
+vllm serve intfloat/multilingual-e5-large \
+  --task embed \
+  --override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}' \
+  --max-model-len 10240 \
+  --trust-remote-code
+```
+
+### Aggregation Algorithm
+
+The chunked processing uses a FastChat-inspired weighted averaging algorithm:
+
+```python
+# Weighted average: sum(embedding_i * token_count_i) / total_tokens
+weighted_sum = sum(embeddings[i] * weights[i] for i in range(num_chunks))
+final_embedding = weighted_sum / sum(weights)
+```
+
+This ensures that longer chunks contribute proportionally more to the final representation.
+
+### Performance Characteristics
+
+| Aspect | Short Text (≤ max_len) | Long Text (> max_len) |
+|--------|------------------------|----------------------|
+| **Processing Time** | Standard | Increased (multiple inference calls) |
+| **Memory Usage** | Standard | Reduced (chunks processed separately) |
+| **Quality** | Standard | Maintains semantic representation |
+| **Compatibility** | Full | Full (backward compatible) |
+
+### Example Usage
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="your-api-key",
+    base_url="http://localhost:31090/v1"
+)
+
+# This will automatically use chunked processing if text is too long
+response = client.embeddings.create(
+    input="Very long text that exceeds the model's maximum context length..." * 1000,
+    model="multilingual-e5-large"
+)
+
+print(f"Embedding dimension: {len(response.data[0].embedding)}")
+```
+
+### Logging and Monitoring
+
+When chunked processing is active, you'll see informative log messages:
+
+```
+INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing
+INFO: Split input of 15000 tokens into 2 chunks
+```
+
+### Limitations
+
+- **Increased Latency**: Processing multiple chunks takes longer than single-chunk processing
+- **Model Support**: Currently limited to specific embedding models
+- **Context Boundaries**: Chunking may split related content, though weighted averaging helps preserve overall semantics
+
 ## Offline Inference
 
 The [LLM][vllm.LLM] class provides various methods for offline inference.
@@ -170,7 +254,7 @@ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
 You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.
 
 ```text
-curl http://127.0.0.1:8000/v1/embeddings \
+curl http://127.0.0.1:31090/v1/embeddings \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
 
@@ -418,7 +418,7 @@ Specified using `--task embed`.
 | `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. |  |  |  |
 | `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. |  |  |  |
 | `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. |  |  |  |
-| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, `intfloat/multilingual-e5-large` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
@@ -437,6 +437,9 @@ Specified using `--task embed`.
 !!! note
     The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
 
+!!! note
+    `intfloat/multilingual-e5-large` supports **long text embedding** with chunked processing. When input text exceeds the model's maximum length, the model automatically splits the input into chunks and processes them separately, then aggregates the results. Enable this feature with `--override-pooler-config '{"pooling_type": "CLS", "normalize": true, "enable_chunked_processing": true}'`. See the [Chunked Processing section](pooling_models.md#chunked-processing-for-long-text) for more details.
+
 If your model is not in the above list, we will try to automatically convert the model using
 [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
 of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
 
@@ -0,0 +1,137 @@
+# Long Text Embedding with Chunked Processing
+
+This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length.
+
+## 🚀 Quick Start
+
+### 1. Start the Server
+
+Use the provided script to start a vLLM server with chunked processing enabled:
+
+```bash
+# Basic usage
+./openai_embedding_long_text_service.sh
+
+# Custom configuration
+MODEL_NAME="intfloat/multilingual-e5-large" \
+PORT=31090 \
+MAX_MODEL_LEN=10240 \
+./openai_embedding_long_text_service.sh
+```
+
+### 2. Test Long Text Embedding
+
+Run the comprehensive test client:
+
+```bash
+python openai_embedding_long_text_client.py
+```
+
+## 📁 Files
+
+| File | Description |
+|------|-------------|
+| `openai_embedding_long_text_service.sh` | Server startup script with chunked processing enabled |
+| `openai_embedding_long_text_client.py` | Comprehensive test client for long text embedding |
+| `openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) |
+
+## ⚙️ Configuration
+
+### Server Configuration
+
+The key parameter for chunked processing is in the `--override-pooler-config`:
+
+```json
+{
+  "pooling_type": "CLS",
+  "normalize": true,
+  "enable_chunked_processing": true
+}
+```
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use |
+| `PORT` | `31090` | Server port |
+| `GPU_COUNT` | `1` | Number of GPUs to use |
+| `MAX_MODEL_LEN` | `10240` | Maximum model context length |
+| `API_KEY` | `EMPTY` | API key for authentication |
+
+## 🔧 How It Works
+
+1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered
+2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity
+3. **Independent Processing**: Each chunk is processed separately through the model
+4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging
+5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing
+
+## 📊 Performance Characteristics
+
+| Text Length | Processing Method | Memory Usage | Speed |
+|-------------|------------------|--------------|-------|
+| ≤ max_len | Standard | Normal | Fast |
+| > max_len | Chunked | Reduced per chunk | Slower (multiple inferences) |
+
+## 🧪 Test Cases
+
+The test client demonstrates:
+
+- ✅ **Short text**: Normal processing (baseline)
+- ✅ **Medium text**: Single chunk processing
+- ✅ **Long text**: Multi-chunk processing with aggregation
+- ✅ **Very long text**: Many chunks processing
+- ✅ **Batch processing**: Mixed-length inputs in one request
+- ✅ **Consistency**: Reproducible results across runs
+
+## 🐛 Troubleshooting
+
+### Common Issues
+
+1. **Chunked processing not enabled**:
+
+   ```
+   ValueError: This model's maximum context length is 512 tokens...
+   ```
+
+   **Solution**: Ensure `enable_chunked_processing: true` in pooler config
+
+2. **Memory errors**:
+  
+```
+   RuntimeError: CUDA out of memory
+   ```
+  
+**Solution**: Reduce `MAX_MODEL_LEN` or use fewer GPUs
+
+1. **Slow processing**:
+   **Expected**: Long text takes more time due to multiple inference calls
+
+### Debug Information
+
+Server logs show chunked processing activity:
+
+```
+INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing
+INFO: Split input of 15000 tokens into 2 chunks
+```
+
+## 📚 Additional Resources
+
+- [Pooling Models Documentation](../../docs/models/pooling_models.md#chunked-processing-for-long-text)
+- [Supported Models List](../../docs/models/supported_models.md#text-embedding)
+- [Original Feature Documentation](../../README_CHUNKED_PROCESSING.md)
+
+## 🤝 Contributing
+
+To extend chunked processing support to other embedding models:
+
+1. Check model compatibility with the pooling architecture
+2. Test with various text lengths
+3. Validate embedding quality compared to single-chunk processing
+4. Submit PR with test cases and documentation updates
+
+---
+
+**Note**: Chunked processing is currently supported for specific embedding models. See the [supported models documentation](../../docs/models/supported_models.md#chunked-processing-for-long-text) for the complete list.