vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-OCR

Frontier OCR model exploring optical context compression for LLMs, optimized for document parsing and markdown generation.

Optical context compression for efficient OCR and document understanding

View on HuggingFace
dense3B8,192 ctxvLLM 0.12.0+multimodal
Guide

Overview

DeepSeek-OCR is a frontier OCR model exploring optical context compression for LLMs. It is optimized for document parsing, free-form OCR, and markdown generation from images, and ships with a custom n-gram logits processor for optimal quality.

Prerequisites

  • Hardware: Single GPU with >=8 GB VRAM is typically sufficient for BF16 inference.
  • vLLM: Current stable release (tested with uv pip install -U vllm --torch-backend auto).
  • Python: 3.10+

Install vLLM:

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Client Usage

Offline OCR (Python)

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
prompt = "<image>\nFree OCR."

model_input = [
    {"prompt": prompt, "multi_modal_data": {"image": image_1}},
    {"prompt": prompt, "multi_modal_data": {"image": image_2}},
]

sampling_param = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td>
    ),
    skip_special_tokens=False,
)

for output in llm.generate(model_input, sampling_param):
    print(output.outputs[0].text)

Online OCR serving

vllm serve deepseek-ai/DeepSeek-OCR \
  --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0
import time
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"}},
            {"type": "text", "text": "Free OCR."},
        ],
    }
]

start = time.time()
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-OCR",
    messages=messages,
    max_tokens=2048,
    temperature=0.0,
    extra_body={
        "skip_special_tokens": False,
        "vllm_xargs": {
            "ngram_size": 30,
            "window_size": 90,
            "whitelist_token_ids": [128821, 128822],
        },
    },
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Troubleshooting / Configuration Tips

  • Use the custom logits processor along with the model for optimal OCR and markdown generation performance.
  • Unlike multi-turn chat, OCR tasks do not typically benefit from prefix caching or image reuse, so disable these features to avoid unnecessary hashing and caching overhead.
  • DeepSeek-OCR works better with plain prompts than instruction formats. See the official example prompts.
  • Depending on your hardware, adjust max_num_batched_tokens for better throughput.

References