vLLM: make <|ref|>…</|ref|> locate prompts work in run_dpsk_ocr_image.py #117
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR updates the vLLM image runner to correctly handle locate/“rec” prompts that use <|ref|>…</|ref|> by wiring the per-request n-gram logits processor and preserving special tokens only when needed. It also replaces remaining eval(...) usages with ast.literal_eval(...) for safer parsing.
Why
Users running locate/“rec” prompts via the script were not getting expected results because special tokens were stripped and the n-gram per-request logits processor wasn’t attached. The change keeps default OCR behavior unchanged while enabling the reference mode when requested.
What’s changed
Add CLI flags:
--prompt to pass a prompt without editing config.py
--ref-mode to force reference/locate behavior
In stream_generate(...):
Detect reference mode if --ref-mode is set or the prompt contains <|ref|> and </|ref|>
When in reference mode, attach NoRepeatNGramLogitsProcessor(ngram_size=30, window_size=90, whitelist_token_ids={128821,128822}) and set skip_special_tokens=False
Otherwise, leave defaults (no logits processor; skip_special_tokens=True)
Replace unsafe eval(...) with ast.literal_eval(...) where coordinates/geometry are parsed
Minor robustness: initialize final_output and avoid duplicate image loads
Files touched
DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py
How to use
\nLocate <|ref|>title<|/ref|> in the image."
python DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py
--input path/to/your.png
--prompt "
--ref-mode
Testing
Argparse/parse sanity (CPU): python -m py_compile DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py
Functional (GPU): run the command above; outputs include result_ori.mmd, result.mmd, images/*.jpg (crops), and result_with_boxes.jpg.
Backward compatibility
Default OCR behavior is unchanged unless --ref-mode is provided or <|ref|>…</|ref|> is detected in the prompt.
Fixes #<114>