# Model Catalog This document provides a comprehensive overview of all models and inference engines available in Docling, organized by processing stage. ## Overview Docling's document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog helps you understand: - What stages are available for document processing - Which model families power each stage - What specific models you can use - Which inference engines support each model ## Stages and Models Overview The following table shows all processing stages in Docling, their model families, and available models.
Stage Model Family Models
Layout
Document structure detection
Object Detection
(RT-DETR based)
  • docling-layout-heron
  • docling-layout-heron-101
  • docling-layout-egret-medium
  • docling-layout-egret-large
  • docling-layout-egret-xlarge
  • docling-layout-v2 (legacy)
Inference Engine: Transformers, ONNXRuntime (in progress)
Purpose: Detects document elements (paragraphs, tables, figures, headers, etc.)
Output: Bounding boxes with element labels (TEXT, TABLE, PICTURE, SECTION_HEADER, etc.)
OCR
Text recognition
Multiple OCR Engines
  • Auto
  • Tesseract (CLI or Python bindings)
  • EasyOCR
  • RapidOCR (ONNX, OpenVINO, PaddlePaddle)
  • macOS Vision (native macOS)
  • SuryaOCR
Inference Engines: Engine-specific
Purpose: Extracts text from images and scanned documents
Table Structure
Table cell recognition
TableFormer
  • TableFormer (accurate mode)
  • TableFormer (fast mode)
Inference Engine: docling-ibm-models
Purpose: Recognizes table structure (rows, columns, cells) and relationships
Table Structure
Table cell recognition
Vision-Language Model
(Granite Vision)
  • granite-vision-4.1-4b
Inference Engine: Transformers
Purpose: VLM-based table structure recognition using OTSL (Open Table Structure Language) output
Table Structure
Table cell recognition
Object Detection
  • Work in progress
Inference Engine: TBD
Purpose: Alternative approach for table structure recognition using object detection
Picture Classifier
Image type classification
Image Classifier
(Vision Transformer)
  • DocumentFigureClassifier-v2.5
Inference Engine: Transformers
Purpose: Classifies pictures into categories (Chart, Diagram, Natural Image, etc.)
VLM Convert
Full page conversion
Vision-Language Models
  • Granite-Docling-258M ⭐ (DocTags)
  • SmolDocling-256M (DocTags)
  • DeepSeek-OCR-3B (Markdown, API-only)
  • Granite-Vision-3.3-2B (Markdown)
  • Pixtral-12B (Markdown)
  • GOT-OCR-2.0 (Markdown)
  • Phi-4-Multimodal (Markdown)
  • Qwen2.5-VL-3B (Markdown)
  • Nanonets-OCR2-3B (Markdown)
  • Gemma-3-12B/27B (Markdown, MLX-only)
  • Dolphin (Markdown)
Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE
Purpose: Converts entire document pages to structured formats (DocTags or Markdown)
Output Formats: DocTags (structured), Markdown (human-readable)
Picture Description
Image captioning
Vision-Language Models
  • SmolVLM-256M
  • Granite-Vision-3.3-2B
  • Pixtral-12B
  • Qwen2.5-VL-3B
Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE
Purpose: Generates natural language descriptions of images and figures
Code & Formula
Code/math extraction
Vision-Language Models
  • CodeFormulaV2
  • Granite-Docling-258M
Inference Engines: Transformers, MLX, AUTO_INLINE
Purpose: Extracts and recognizes code blocks and mathematical formulas
## Inference Engines by Model Family ### Object Detection Models (Layout) | Model | Inference Engine | Supported Devices | |-------|------------------|-------------------| | All Layout models | docling-ibm-models | CPU, CUDA, MPS, XPU | **Note:** Layout models use a specialized RT-DETR-based object detection framework from `docling-ibm-models`. ### TableFormer Models (Table Structure) | Model | Inference Engine | Supported Devices | |-------|------------------|-------------------| | TableFormer (fast) | docling-ibm-models | CPU, CUDA, XPU | | TableFormer (accurate) | docling-ibm-models | CPU, CUDA, XPU | **Note:** MPS is currently disabled for TableFormer due to performance issues. ### Image Classifier (Picture Classifier) | Model | Inference Engine | Supported Devices | |-------|------------------|-------------------| | DocumentFigureClassifier-v2.5 | Transformers (ViT) | CPU, CUDA, MPS, XPU | ### OCR Engines | OCR Engine | Backend | Language Support | Notes | |------------|---------|------------------|-------| | Tesseract | CLI or tesserocr | 100+ languages | Most widely used, good accuracy | | EasyOCR | PyTorch | 80+ languages | GPU-accelerated, good for Asian languages | | RapidOCR | ONNX/OpenVINO/Paddle | Multiple | Fast, multiple backend options | | macOS Vision | Native macOS | 20+ languages | macOS only, excellent quality | | SuryaOCR | PyTorch | 90+ languages | Modern, good for complex layouts | | Auto | Automatic | Varies | Automatically selects best available engine | ### Vision-Language Models (VLM) #### VLM Convert Stage | Preset ID | Model | Parameters | Transformers | MLX | API (OpenAI-compatible) | vLLM | Output Format | |-----------|-------|------------|--------------|-----|-------------------------|------|---------------| | `granite_docling` | Granite-Docling-258M | 258M | ✅ | ✅ | Ollama | ❌ | DocTags | | `smoldocling` | SmolDocling-256M | 256M | ✅ | ✅ | ❌ | ❌ | DocTags | | `deepseek_ocr` | DeepSeek-OCR-3B | 3B | ❌ | ❌ | Ollama
LM Studio | ❌ | Markdown | | `granite_vision` | Granite-Vision-3.3-2B | 2B | ✅ | ❌ | Ollama
LM Studio | ✅ | Markdown | | `pixtral` | Pixtral-12B | 12B | ✅ | ✅ | ❌ | ❌ | Markdown | | `got_ocr` | GOT-OCR-2.0 | - | ✅ | ❌ | ❌ | ❌ | Markdown | | `phi4` | Phi-4-Multimodal | - | ✅ | ❌ | ❌ | ✅ | Markdown | | `qwen` | Qwen2.5-VL-3B | 3B | ✅ | ✅ | ❌ | ❌ | Markdown | | `nanonets_ocr2` | Nanonets-OCR2-3B | 3B | ✅ | ✅ | OpenAI-compatible
LM Studio | ✅ | Markdown | | `gemma_12b` | Gemma-3-12B | 12B | ❌ | ✅ | ❌ | ❌ | Markdown | | `gemma_27b` | Gemma-3-27B | 27B | ❌ | ✅ | ❌ | ❌ | Markdown | | `dolphin` | Dolphin | - | ✅ | ❌ | ❌ | ❌ | Markdown | `nanonets_ocr2` includes preset API overrides for OpenAI-compatible runtimes and LM Studio, and can also be used with vLLM runtimes. #### Picture Description Stage | Preset ID | Model | Parameters | Transformers | MLX | API (OpenAI-compatible) | vLLM | |-----------|-------|------------|--------------|-----|-------------------------|------| | `smolvlm` | SmolVLM-256M | 256M | ✅ | ✅ | LM Studio | ❌ | | `granite_vision` | Granite-Vision-3.3-2B | 2B | ✅ | ❌ | Ollama
LM Studio | ✅ | | `pixtral` | Pixtral-12B | 12B | ✅ | ✅ | ❌ | ❌ | | `qwen` | Qwen2.5-VL-3B | 3B | ✅ | ✅ | ❌ | ❌ | #### Code & Formula Stage | Preset ID | Model | Parameters | Transformers | MLX | |-----------|-------|------------|--------------|-----| | `codeformulav2` | CodeFormulaV2 | - | ✅ | ❌ | | `granite_docling` | Granite-Docling-258M | 258M | ✅ | ✅ | ## Usage Examples ### Layout Detection ```python from docling.datamodel.pipeline_options import LayoutOptions from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON # Use Heron layout model (default) layout_options = LayoutOptions(model_spec=DOCLING_LAYOUT_HERON) ``` ### Table Structure Recognition ```python from docling.datamodel.pipeline_options import TableStructureOptions, TableFormerMode # Use accurate mode for best quality table_options = TableStructureOptions( mode=TableFormerMode.ACCURATE, do_cell_matching=True ) ``` ### Picture Classification ```python from docling.models.stages.picture_classifier.document_picture_classifier import ( DocumentPictureClassifierOptions ) # Use default picture classifier classifier_options = DocumentPictureClassifierOptions.from_preset("document_figure_classifier_v2") ``` ### OCR ```python from docling.datamodel.pipeline_options import TesseractOcrOptions # Use Tesseract with English and German ocr_options = TesseractOcrOptions(lang=["eng", "deu"]) ``` ### VLM Convert (Full Page) ```python from docling.datamodel.pipeline_options import VlmConvertOptions # Use SmolDocling with auto-selected engine options = VlmConvertOptions.from_preset("smoldocling") # Or force specific engine from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions options = VlmConvertOptions.from_preset( "smoldocling", engine_options=MlxVlmEngineOptions() ) ``` ### Picture Description ```python from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions # Use Granite Vision for detailed descriptions options = PictureDescriptionVlmOptions.from_preset("granite_vision") ``` ### Code & Formula Extraction ```python from docling.datamodel.pipeline_options import CodeFormulaVlmOptions # Use specialized CodeFormulaV2 model options = CodeFormulaVlmOptions.from_preset("codeformulav2") ``` ## Additional Resources - [Vision Models Usage Guide](vision_models.md) - VLM-specific documentation - [Advanced Options](advanced_options.md) - Advanced configuration - [GPU Support](gpu.md) - GPU acceleration setup - [Supported Formats](supported_formats.md) - Input format support ## Notes - **DocTags Format:** Structured XML-like format optimized for document understanding - **Markdown Format:** Human-readable format for general-purpose conversion - **Model Updates:** New models are added regularly. Check the codebase for latest additions - **Engine Compatibility:** Not all engines work on all platforms. AUTO_INLINE handles this automatically - **Performance:** Actual performance varies by hardware, document complexity, and model size