|
| 1 | +# Long Text Embedding with Chunked Processing |
| 2 | + |
| 3 | +This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length. |
| 4 | + |
| 5 | +## 🚀 Quick Start |
| 6 | + |
| 7 | +### 1. Start the Server |
| 8 | + |
| 9 | +Use the provided script to start a vLLM server with chunked processing enabled: |
| 10 | + |
| 11 | +```bash |
| 12 | +# Basic usage |
| 13 | +./openai_embedding_long_text_service.sh |
| 14 | + |
| 15 | +# Custom configuration |
| 16 | +MODEL_NAME="intfloat/multilingual-e5-large" \ |
| 17 | +PORT=31090 \ |
| 18 | +MAX_MODEL_LEN=10240 \ |
| 19 | +./openai_embedding_long_text_service.sh |
| 20 | +``` |
| 21 | + |
| 22 | +### 2. Test Long Text Embedding |
| 23 | + |
| 24 | +Run the comprehensive test client: |
| 25 | + |
| 26 | +```bash |
| 27 | +python openai_embedding_long_text_client.py |
| 28 | +``` |
| 29 | + |
| 30 | +## 📁 Files |
| 31 | + |
| 32 | +| File | Description | |
| 33 | +|------|-------------| |
| 34 | +| `openai_embedding_long_text_service.sh` | Server startup script with chunked processing enabled | |
| 35 | +| `openai_embedding_long_text_client.py` | Comprehensive test client for long text embedding | |
| 36 | +| `openai_embedding_client.py` | Basic embedding client (updated with chunked processing info) | |
| 37 | + |
| 38 | +## ⚙️ Configuration |
| 39 | + |
| 40 | +### Server Configuration |
| 41 | + |
| 42 | +The key parameter for chunked processing is in the `--override-pooler-config`: |
| 43 | + |
| 44 | +```json |
| 45 | +{ |
| 46 | + "pooling_type": "CLS", |
| 47 | + "normalize": true, |
| 48 | + "enable_chunked_processing": true |
| 49 | +} |
| 50 | +``` |
| 51 | + |
| 52 | +### Environment Variables |
| 53 | + |
| 54 | +| Variable | Default | Description | |
| 55 | +|----------|---------|-------------| |
| 56 | +| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use | |
| 57 | +| `PORT` | `31090` | Server port | |
| 58 | +| `GPU_COUNT` | `1` | Number of GPUs to use | |
| 59 | +| `MAX_MODEL_LEN` | `10240` | Maximum model context length | |
| 60 | +| `API_KEY` | `EMPTY` | API key for authentication | |
| 61 | + |
| 62 | +## 🔧 How It Works |
| 63 | + |
| 64 | +1. **Automatic Detection**: When input text exceeds `max_model_len`, chunked processing is triggered |
| 65 | +2. **Smart Chunking**: Text is split at token boundaries to maintain semantic integrity |
| 66 | +3. **Independent Processing**: Each chunk is processed separately through the model |
| 67 | +4. **Weighted Aggregation**: Results are combined using token count-based weighted averaging |
| 68 | +5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing |
| 69 | + |
| 70 | +## 📊 Performance Characteristics |
| 71 | + |
| 72 | +| Text Length | Processing Method | Memory Usage | Speed | |
| 73 | +|-------------|------------------|--------------|-------| |
| 74 | +| ≤ max_len | Standard | Normal | Fast | |
| 75 | +| > max_len | Chunked | Reduced per chunk | Slower (multiple inferences) | |
| 76 | + |
| 77 | +## 🧪 Test Cases |
| 78 | + |
| 79 | +The test client demonstrates: |
| 80 | + |
| 81 | +- ✅ **Short text**: Normal processing (baseline) |
| 82 | +- ✅ **Medium text**: Single chunk processing |
| 83 | +- ✅ **Long text**: Multi-chunk processing with aggregation |
| 84 | +- ✅ **Very long text**: Many chunks processing |
| 85 | +- ✅ **Batch processing**: Mixed-length inputs in one request |
| 86 | +- ✅ **Consistency**: Reproducible results across runs |
| 87 | + |
| 88 | +## 🐛 Troubleshooting |
| 89 | + |
| 90 | +### Common Issues |
| 91 | + |
| 92 | +1. **Chunked processing not enabled**: |
| 93 | + |
| 94 | + ``` |
| 95 | + ValueError: This model's maximum context length is 512 tokens... |
| 96 | + ``` |
| 97 | + |
| 98 | + **Solution**: Ensure `enable_chunked_processing: true` in pooler config |
| 99 | + |
| 100 | +2. **Memory errors**: |
| 101 | + |
| 102 | +``` |
| 103 | + RuntimeError: CUDA out of memory |
| 104 | + ``` |
| 105 | + |
| 106 | +**Solution**: Reduce `MAX_MODEL_LEN` or use fewer GPUs |
| 107 | + |
| 108 | +1. **Slow processing**: |
| 109 | + **Expected**: Long text takes more time due to multiple inference calls |
| 110 | + |
| 111 | +### Debug Information |
| 112 | + |
| 113 | +Server logs show chunked processing activity: |
| 114 | + |
| 115 | +``` |
| 116 | +INFO: Input length 15000 exceeds max_model_len 10240, will use chunked processing |
| 117 | +INFO: Split input of 15000 tokens into 2 chunks |
| 118 | +``` |
| 119 | + |
| 120 | +## 📚 Additional Resources |
| 121 | + |
| 122 | +- [Pooling Models Documentation](../../docs/models/pooling_models.md#chunked-processing-for-long-text) |
| 123 | +- [Supported Models List](../../docs/models/supported_models.md#text-embedding) |
| 124 | +- [Original Feature Documentation](../../README_CHUNKED_PROCESSING.md) |
| 125 | + |
| 126 | +## 🤝 Contributing |
| 127 | + |
| 128 | +To extend chunked processing support to other embedding models: |
| 129 | + |
| 130 | +1. Check model compatibility with the pooling architecture |
| 131 | +2. Test with various text lengths |
| 132 | +3. Validate embedding quality compared to single-chunk processing |
| 133 | +4. Submit PR with test cases and documentation updates |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +**Note**: Chunked processing is currently supported for specific embedding models. See the [supported models documentation](../../docs/models/supported_models.md#chunked-processing-for-long-text) for the complete list. |
0 commit comments