MinerU is an all-in-one open-source intelligent document parsing tool developed by the OpenDataLab team at Shanghai AI Laboratory. It dedicated to solving high-quality structured data extraction challenges in large model training, RAG systems, and knowledge base construction.
I. Core Features (User-Friendly)
- Deep PDF Parsing: Automatically extracts text, tables, images, and formulas (converted to LaTeX), accurately identifies headings, paragraphs, and lists while preserving original layout; supports OCR for scanned PDFs and automatically filters redundant content like headers, footers, and footnotes.
- Multi-Format Compatibility: Supports PDFs, PNG/JPEG images, EPUB/MOBI/DOCX e-books, and extracts clean main content from web pages.
- Multilingual Support: OCR for 109+ languages, ideal for cross-border document processing.
- Structured Output: One-click conversion to Markdown (with multimodal elements), JSON, and HTML, output follows human reading order for direct use by large models.
- Lightweight & Efficient: 0.9B parameter model runs smoothly on consumer-grade GPUs, with fast inference and low deployment costs.
- Scientific Data Capability: High-precision extraction of mathematical formulas, chemical molecular structures, and chemical reaction equations for scientific document parsing.
II. Use Cases
- Large model training data cleaning and structuring
- RAG systems and enterprise knowledge base construction
- Academic papers, research reports, and financial statement parsing
- Batch e-book conversion and web content extraction
- Scanned document digitization and information extraction
III. Underlying Technologies
- Vision-Language Models (VLM), LayoutLMv3 (layout analysis)
- Custom YOLOv8 (formula detection) + UniMERNet (formula to LaTeX)
- PaddleOCR (multilingual text recognition)
- SGLang inference optimization, Native-Res ViT native high-resolution vision technology
- Multi-module parsing architecture based on PDF-Extract-Kit