Skip to content

Enable DeepSeek-OCR support in latest vLLM 0.11.0 (v1 Engine) with custom modifications #231

@wangwk97

Description

@wangwk97

Environment

pip install vllm==0.11.0
pip install PyMuPDF img2pdf einops easydict addict Pillow
pip install flash_attn==2.8.1 --no-build-isolation

In vLLM 0.11.0, the legacy v0 engine (AsyncLLMEngine/LLMEngine) has been internally redirected to the new v1 engine:

# async_llm_engine.py & llm_engine.py
from vllm.v1.engine.async_llm import AsyncLLM
AsyncLLMEngine = AsyncLLM  # type: ignore

Additionally, per-request logits processors are no longer supported directly. Instead, latest vLLM introduces global logit processors via the new AdapterLogitsProcessor interface (vllm.v1.sample.logits_processor.AdapterLogitsProcessor), which allows adaptation of per-request logic.

To enable DeepSeek-OCR compatibility with vLLM 0.11.0, the following changes are required:


1. Add v1-Compatible Logits Processor Adapter

Create ngram_norepeat_v1_adapter.py in DeepSeek-OCR-vllm/process/:

from .ngram_norepeat import NoRepeatNGramLogitsProcessor
from vllm.v1.sample.logits_processor import AdapterLogitsProcessor


class NoRepeatNGramAdaptor(AdapterLogitsProcessor):
    def is_argmax_invariant(self) -> bool:
        return True

    def new_req_logits_processor(self, params):
        return NoRepeatNGramLogitsProcessor(
            ngram_size=params.extra_args["ngram_size"],
            window_size=params.extra_args["window_size"],
            whitelist_token_ids=params.extra_args["whitelist_token_ids"],
        )

2. Update DeepSeek-OCR Core Code

a. Handle v0/v1 SamplingMetadata import (line 14 in deepseek_ocr.py)

try:
    from vllm.model_executor import SamplingMetadata
except ImportError:
    from vllm.v1.sample.metadata import SamplingMetadata

b. Update _call_hf_processor signatures (lines ~154 and ~231)

Both instances should accept **kwargs to align with v1’s tokenizer kwargs handling:

def _call_hf_processor(
    self,
    prompt: str,
    mm_data: Mapping[str, object],
    mm_kwargs: Mapping[str, object],
    **kwargs,  # tokenizer kwargs in v1
) -> BatchFeature:
    ...

c. Propagate **kwargs in _cached_apply_hf_processor (lines ~231–254)

Ensure kwargs are forwarded in the overridden caching method:

def _cached_apply_hf_processor(
    self,
    prompt: Union[str, list[int]],
    mm_data_items: MultiModalDataItems,
    hf_processor_mm_kwargs: Mapping[str, object],
    **kwargs  # forward to underlying processor
) -> tuple[list[int], MultiModalKwargs, bool]:
    if mm_data_items.get_count("image", strict=False) > 2:
        return self._apply_hf_processor_main(
            prompt=prompt,
            mm_items=mm_data_items,
            hf_processor_mm_kwargs=hf_processor_mm_kwargs,
            enable_hf_prompt_update=True,
            **kwargs
        )
    return super()._cached_apply_hf_processor(
        prompt=prompt,
        mm_data_items=mm_data_items,
        hf_processor_mm_kwargs=hf_processor_mm_kwargs,
        **kwargs
    )

3. Update Inference Script for v1 Engine

In run_dpsk_ocr_image.py:

a. Enable v1 engine explicitly (optional but recommended):

import os
os.environ['VLLM_USE_V1'] = '1'

b. Initialize engine with v1-compatible logits processor:

engine_args = AsyncEngineArgs(
    model=MODEL_PATH,
    hf_overrides={"architectures": ["DeepseekOCRForCausalLM"]},
    block_size=256,
    max_model_len=8192,
    enforce_eager=False,
    trust_remote_code=True,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.75,
    logits_processors=["process.ngram_norepeat_v1_adapter:NoRepeatNGramAdaptor"],
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

c. Pass n-gram parameters via extra_args in SamplingParams:

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    skip_special_tokens=False,
    extra_args={
        "ngram_size": 30,
        "window_size": 90,
        "whitelist_token_ids": {128821, 128822}
    }
)

Important: The OCR processor must not be invoked before the vLLM engine is fully initialized. Specifically, avoid calling DeepseekOCRProcessor().tokenize_with_images(...) before engine startup. This issue does not occur with the v0 engine.

Instead, ensure engine initialization completes first, then process the image:

engine = AsyncLLMEngine.from_engine_args(engine_args)
# After engine is created
if '<image>' in PROMPT:
    image_features = DeepseekOCRProcessor().tokenize_with_images(
        images=[image], bos=True, eos=True, cropping=CROP_MODE
    )
else:
    image_features = ''

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions