国产开源PDF解析工具MinerU

在这里插入图片描述

前言

PDF的数据解析是一件较困难的事情，几乎所有商家都把PDF转WORD功能做成付费产品。

PDF是基于PostScript子集渲染的，PostScript是一门图灵完备的语言。而WORD需要的渲染，本质上是PDF能力的子集。大模型领域，我们的目标文件格式一般是markdown，markdown相较于WORD更加简单，是WORD的子集。

子集向父集转换是容易的，因为子集有的功能，父集都有。而父集向子集转换是困难的，因为父集的众多功能，子集并不具备。

通过元素映射的方式来实现PDF的解析，是不现实的。于是，上海人工智能实验室的研发人员提出利用多种深度学习算法，来直接分析和识别PDF上的文字、图片、公式、表格等，再反向合并成最终的markdown文件。

总的来说，PaddleOCR 负责文本的检测与识别，而 TableMaster 负责表格的结构解析和内容整合，二者结合实现了对文档图像中表格的全面识别和理解。

MinerU涉及的模型

模型名称	模型功能	模型详情
LayoutLMv3	布局检测模型	unilm/layoutlmv3 at master · microsoft/unilm (github.com)
UniMERNet	公式识别模型	opendatalab/UniMERNet: UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition (github.com)
StructEqTable	表格识别模型	Alpha-Innovator/StructEqTable-Deploy: A High-efficiency Open-source Toolkit for Table-to-Latex Task (github.com)
YOLO	公式检测模型	ultralytics/ultralytics: Ultralytics YOLO11 🚀 (github.com)
PaddleOCR	OCR模型	PaddlePaddle/PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices) (github.com)
DocLayout-YOLO	布局检测模型	opendatalab/DocLayout-YOLO: DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception (github.com)

将DeepSeek V2论文输入到MinerU中，得到下列输出内容：

1.images目录
pdf中的图片
2.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M.md
最终输出的markdown文件
3.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_content_list.json
未知
4.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_layout.pdf
版面分析结果
5.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_middle.json
包含以下字段信息：

字段名	解释
pdf_info	list，每个元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表
_parse_type	ocr \| txt，用来标识本次解析的中间态使用的模式
_version_name	string, 表示本次解析使用的 magic-pdf 的版本号

6.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_model.json
所有元素的检测框坐标

[

    {

        "layout_dets": [

            {

                "category_id": 1,

                "poly": [

                    193,

                    793,

                    1462,

                    793,

                    1462,

                    1354,

                    193,

                    1354

                ],

                "score": 0.983

            },

            {

                "category_id": 0,

                "poly": [

                    319,

                    314,

                    1340,

                    314,

                    1340,

                    424,

                    319,

                    424

                ],

                "score": 0.968

            },

            {

                "category_id": 3,

                "poly": [

                    207,

                    1410,

                    1444,

                    1410,

                    1444,

                    1976,

                    207,

                    1976

                ],

                "score": 0.966

            },

7.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_origin.pdf
原始pdf文件
8.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_spans.pdf
不同元素的检测框可视化

Miner功能

删除页眉、页脚、脚注、页码等元素，确保语义连贯
输出符合人类阅读顺序的文本，适用于单栏、多栏及复杂排版
保留原文档的结构，包括标题、段落、列表等
提取图像、图片描述、表格、表格标题及脚注
自动识别并转换文档中的公式为LaTeX格式
自动识别并转换文档中的表格为HTML格式
自动检测扫描版PDF和乱码PDF，并启用OCR功能
OCR支持84种语言的检测与识别
支持多种输出格式，如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
支持多种可视化结果，包括layout可视化、span可视化等，便于高效确认输出效果与质检
支持纯CPU环境运行，并支持 GPU(CUDA)/NPU(CANN)/MPS 加速
兼容Windows、Linux和Mac平台

Miner效果实测

最令人惊叹的是公式识别，例如输入pdf样式如下：
![[Pasted image 20250221100703.png]]

输出markdown样式如下：
![[Pasted image 20250221100901.png]]
基本上没什么问题，但是小状况还是挺多的，例如将 $\mathbb{R}^{d_h n_h\times d}$ 识别成了 $\ × d \mathbb{R}^{d_h n_h\backslash\ \times d}$