Release Multi-Arc Serving release 0.1.0 · intel/ipex-llm

Overview

This release introduces the latest update to the Multi-ARC vLLM serving solution, optimized for Intel Xeon + ARC platforms with ipex-llm vLLM. The new version delivers low latency and high throughput LLM serving with improved model compatibility and resource efficiency. Major component upgrades include: vLLM upgraded to 0.6.6, PyTorch upgraded to 2.6, oneAPI upgraded to 2025.0, oneCCL patch updated to 0.0.6.6.

New Features

Optimized vLLM serving for Intel Xeon + ARC multi-GPU platforms, enabling lower latency and higher throughput.
Supported various LLM models.
Enhanced support for loading models with minimal memory requirements.
Refined Docker image for improved ease of use and deployment.
Improved WebUI model connectivity and stability.
Added VLLM_LOG_OUTPUT=1 option to enable detailed input/output logging for vLLM.

Bug Fixes

Resolved multimodal issues including get_image failures and inference errors with models such as MiniCPM-V-2_6, Qwen2-VL, and GLM-4v-9B.
Fixed Qwen2-VL multi-request crash by removing Qwen2VisionAttention’s attention_mask and addressing mrope_positions instability.
Updated profile_run usage to avoid OOM (Out of Memory) crashes.
Resolved GQA kernel issues causing errors with multiple concurrent outputs.
Fixed --enable-prefix-caching none crash in specific cases.
Addressed low-bit overflow causing !!!!!! output error in DeepSeek-R1-Distill-Qwen-14B.
Resolved GPTQ and AWQ-related errors to improve compatibility across more models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-Arc Serving release 0.1.0

Overview

New Features

Bug Fixes

Docker Images

Uh oh!