DeepSeek-OCR 如何使用？分步教程指南

DeepSeek-OCR 是一个用于光学字符识别（OCR）的强大工具，它可以将图像和 PDF 文档转换为结构化文本。本教程将一步步指导你如何安装、配置和使用 DeepSeek-OCR。

开源项目地址：https://github.com/deepseek-ai/DeepSeek-OCR/tree/main

第一步：环境准备

系统要求

操作系统：Linux/Windows/macOS
Python 版本：3.12.9
CUDA 版本：11.8 或更高
PyTorch 版本：2.6.0

硬件要求

推荐 GPU：A100-40G 或同等性能显卡
内存：至少 16GB RAM
存储空间：至少 10GB 可用空间

第二步：下载和克隆项目

克隆 GitHub 仓库

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

查看项目结构项目包含以下主要文件：

DeepSeek-OCR-master/ - 主要代码目录
assets/ - 资源文件
requirements.txt - 依赖包列表
README.md - 项目说明文档

第三步：环境配置

创建 Conda 环境

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

安装 PyTorch

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118

安装 vLLM（推荐）

# 下载并安装vLLM-0.8.5 whl文件
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

安装其他依赖

pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

第四步：模型下载和配置

自动下载模型

当你首次运行 DeepSeek-OCR 时，模型会自动从 Hugging Face 下载：

model_name = 'deepseek-ai/DeepSeek-OCR'

配置文件设置

编辑配置文件 DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py：

设置输入路径（INPUT_PATH）
设置输出路径（OUTPUT_PATH）
调整其他相关参数

第五步：使用方法详解

使用 vLLM 进行推理（推荐）

方法一：处理单张图像

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_image.py

方法二：处理 PDF 文档

python run_dpsk_ocr_pdf.py

注：处理 PDF 时可达到约 2500tokens/s 的速度（在 A100-40G 上）

方法三：批量评估

python run_dpsk_ocr_eval_batch.py

使用 Transformers 进行推理

创建 Python 脚本

from transformers import AutoModel, AutoTokenizer
import torch
import os

# 设置GPU
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

# 加载模型和分词器
model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True
)

model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)

# 设置模型参数
model = model.eval().cuda().to(torch.bfloat16)

# 定义提示词和图像路径
prompt = "<image><|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

# 执行推理
res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=True,
    test_compress=True
)

或使用现成脚本

cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

第六步：支持的模式和配置

原生分辨率模式

Tiny: 512×512 （64 个视觉 tokens）
Small: 640×640 （100 个视觉 tokens）
Base: 1024×1024 （256 个视觉 tokens）
Large: 1280×1280 （400 个视觉 tokens）

动态分辨率模式

Gundam: n×640×640 + 1×1024×1024

第七步：提示词模板常用提示词

# 文档转换
prompt = "<image>\n<|grounding|>Convert the document to markdown."

# 普通OCR
prompt = "<image>\n<|grounding|>OCR this image."

# 无布局OCR
prompt = "<image>\nFree OCR."

# 图表解析
prompt = "<image>\nParse the figure."

# 详细描述
prompt = "<image>\nDescribe this image in detail."

# 定位特定内容
prompt = "<image>\nLocate <|ref|>目标文字<|/ref|> in the image."

第八步：常见问题解决

安装问题如果遇到 vLLM 安装错误：

vllm 0.8.5+cu118 requires transformers>=4.51.1

这是正常现象，不会影响使用。

内存不足

降低 batch_size
使用较小的分辨率模式
关闭不必要的程序

GPU 显存不足

使用量化模型
减少并发处理数量
调整图像尺寸

第九步：性能优化建议

硬件优化

使用高性能 GPU（如 A100、H100）
确保足够的显存
使用 SSD 存储以提高 I/O 速度

软件优化

使用 vLLM 而非 Transformers 获得更好性能
开启 Flash Attention 2
根据任务选择合适的分辨率模式

总结

DeepSeek-OCR 是一个功能强大的 OCR 工具，通过本教程的分步指导，你应该能够：

成功安装和配置环境
掌握基本使用方法
了解不同的推理模式
解决常见问题
优化性能表现

如果你遇到任何问题，可以参考 GitHub 仓库的 Issues 部分或查看官方文档。

开发必备：API 全流程管理神器 Apifox

介绍完上文的内容，我想额外介绍一个对开发者同样重要的效率工具 —— Apifox。作为一个集 API 文档、API 调试、API 设计、API 测试、API Mock、自动化测试等功能于一体的 API 管理工具，Apifox 可以说是开发者提升效率的必备工具之一。

如果你正在开发项目需要进行接口调试，不妨试试 Apifox。注册过程非常简单，你可以直接在这里注册使用。

立即体验 Apifox

注册成功后可以先看看官方提供的示例项目，这些案例都是经过精心设计的，能帮助你快速了解 Apifox 的主要功能。

使用 Apifox 的一大优势是它完全兼容 Postman 和 Swagger 数据格式，如果你之前使用过这些工具，数据导入会非常方便。而且它的界面设计非常友好，即使是第一次接触的新手也能很快上手，快去试试吧！

免费使用 Apifox