DeepSeek-OCR is a vision-language OCR model designed to study how long textual contexts can be compressed through optical 2D mapping. It was introduced in the paper DeepSeek-OCR: Contexts Optical Compression by Wei, Sun, and Li (2025). (arxiv.org)
Code, model weights, and inference examples are publicly available in the official GitHub repository. (GitHub)
DeepSeek-OCR is an OCR-oriented vision-language model that investigates document understanding from an LLM-centric compression perspective. Instead of relying on a heavy vision token stream, it aims to compress high-resolution, text-rich images into a relatively small number of visual tokens, which are then decoded into text or structured outputs. (arxiv.org)
The model consists of two main components: DeepEncoder, which performs the visual compression, and DeepSeek3B-MoE-A570M as the decoder. The encoder is designed to maintain low activations under high-resolution inputs while achieving strong compression ratios, enabling practical OCR over complex documents with long textual content. (arxiv.org)
Key traits of DeepSeek-OCR:
- Optical context compression: Compresses text-rich page images into a manageable number of vision tokens before decoding.
- LLM-centric design: Treats OCR as a long-context compression and reconstruction problem rather than only a perception task.
- Structured document outputs: Supports prompts such as converting documents to markdown, free OCR, figure parsing, and grounding-oriented localization.
- Multiple resolution modes: Supports native and dynamic resolution settings with different token budgets.
- Practical inference support: Can be used through both vLLM and Transformers inference pipelines. (GitHub)

Figure 1 (from the paper) illustrates the workflow of DeepSeek-OCR:
- A document or image is provided as input, typically using prompts such as “Convert the document to markdown” or “OCR this image.” (GitHub)
- The DeepEncoder compresses the high-resolution visual input into a compact set of visual tokens. (arxiv.org)
- A DeepSeek3B-MoE-A570M decoder reconstructs textual or structured output from the compressed visual representation. (arxiv.org)
- The open-source model supports several native resolution modes, including 512×512, 640×640, 1024×1024, and 1280×1280, as well as a dynamic resolution mode. (GitHub)
- The project also supports PDF OCR, batched evaluation, and online serving through upstream vLLM integration. (GitHub)
DeepSeek-OCR is intended for:
- OCR of text-rich documents with compact visual tokenization.
- Document-to-markdown conversion and structured document parsing.
- Efficient PDF and page-level OCR pipelines where token efficiency is important.
- Research on long-context visual-text compression and LLM-oriented document understanding. (GitHub)
Limitations:
- DeepSeek-OCR is presented as an initial investigation, so it should be viewed partly as a research model exploring the compression role of vision encoders. (arxiv.org)
- The documented environment targets CUDA 11.8 and PyTorch 2.6.0, so deployment may require a compatible GPU software stack. (GitHub)
- Some use cases may require specialized setup through vLLM, Transformers, or FlashAttention-based dependencies. (GitHub)
- The repository already points to DeepSeek-OCR2 as a newer release, so the original DeepSeek-OCR may not be the latest generation in this line. (GitHub)
¶ BibTeX entry and citation info
@article{wei2025deepseek,
title={DeepSeek-OCR: Contexts Optical Compression},
author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
journal={arXiv preprint arXiv:2510.18234},
year={2025}
}