Skip to content

Conversation

@aryanrahar
Copy link

Summary
This PR updates the vLLM image runner to correctly handle locate/“rec” prompts that use <|ref|>…</|ref|> by wiring the per-request n-gram logits processor and preserving special tokens only when needed. It also replaces remaining eval(...) usages with ast.literal_eval(...) for safer parsing.

Why
Users running locate/“rec” prompts via the script were not getting expected results because special tokens were stripped and the n-gram per-request logits processor wasn’t attached. The change keeps default OCR behavior unchanged while enabling the reference mode when requested.

What’s changed
Add CLI flags:
--prompt to pass a prompt without editing config.py
--ref-mode to force reference/locate behavior
In stream_generate(...):
Detect reference mode if --ref-mode is set or the prompt contains <|ref|> and </|ref|>
When in reference mode, attach NoRepeatNGramLogitsProcessor(ngram_size=30, window_size=90, whitelist_token_ids={128821,128822}) and set skip_special_tokens=False
Otherwise, leave defaults (no logits processor; skip_special_tokens=True)
Replace unsafe eval(...) with ast.literal_eval(...) where coordinates/geometry are parsed
Minor robustness: initialize final_output and avoid duplicate image loads

Files touched
DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py

How to use
python DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py
--input path/to/your.png
--prompt "\nLocate <|ref|>title<|/ref|> in the image."
--ref-mode

Testing
Argparse/parse sanity (CPU): python -m py_compile DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py
Functional (GPU): run the command above; outputs include result_ori.mmd, result.mmd, images/*.jpg (crops), and result_with_boxes.jpg.

Backward compatibility
Default OCR behavior is unchanged unless --ref-mode is provided or <|ref|>…</|ref|> is detected in the prompt.

Fixes #<114>

…ts; use literal_eval

Signed-off-by: Aryan Rahar <aryanrahar1@gmail.com>
@Eliezermga
Copy link

Good improvement overall! It might help to add short docstrings or inline comments around the reference mode detection logic (if --ref-mode or <|ref|>), to clarify how the n-gram logits processor interacts with special tokens. That would make the intent clearer for future contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants