# Model Catalog This document provides a comprehensive overview of all models and inference engines available in Docling, organized by processing stage. ## Overview Docling's document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog helps you understand: - What stages are available for document processing - Which model families power each stage - What specific models you can use - Which inference engines support each model ## Stages and Models Overview The following table shows all processing stages in Docling, their model families, and available models.

Stage	Model Family	Models
Layout Document structure detection	Object Detection (RT-DETR based)	`docling-layout-heron` ⭐ `docling-layout-heron-101` `docling-layout-egret-medium` `docling-layout-egret-large` `docling-layout-egret-xlarge` `docling-layout-v2` (legacy)
		Inference Engine: Transformers, ONNXRuntime (in progress)
		Purpose: Detects document elements (paragraphs, tables, figures, headers, etc.)
		Output: Bounding boxes with element labels (TEXT, TABLE, PICTURE, SECTION_HEADER, etc.)
OCR Text recognition	Multiple OCR Engines	Auto ⭐ Tesseract (CLI or Python bindings) EasyOCR RapidOCR (ONNX, OpenVINO, PaddlePaddle) macOS Vision (native macOS) SuryaOCR
		Inference Engines: Engine-specific
		Purpose: Extracts text from images and scanned documents
Table Structure Table cell recognition	TableFormer	`TableFormer (accurate mode)` ⭐ `TableFormer (fast mode)`
		Inference Engine: docling-ibm-models
		Purpose: Recognizes table structure (rows, columns, cells) and relationships
Table Structure Table cell recognition	Vision-Language Model (Granite Vision)	`granite-vision-4.1-4b`
		Inference Engine: Transformers
		Purpose: VLM-based table structure recognition using OTSL (Open Table Structure Language) output
Table Structure Table cell recognition	Object Detection	Work in progress
		Inference Engine: TBD
		Purpose: Alternative approach for table structure recognition using object detection
Picture Classifier Image type classification	Image Classifier (Vision Transformer)	`DocumentFigureClassifier-v2.5` ⭐
		Inference Engine: Transformers
		Purpose: Classifies pictures into categories (Chart, Diagram, Natural Image, etc.)
VLM Convert Full page conversion	Vision-Language Models	Granite-Docling-258M ⭐ (DocTags) SmolDocling-256M (DocTags) DeepSeek-OCR-3B (Markdown, API-only) Granite-Vision-3.3-2B (Markdown) Pixtral-12B (Markdown) GOT-OCR-2.0 (Markdown) Phi-4-Multimodal (Markdown) Qwen2.5-VL-3B (Markdown) Nanonets-OCR2-3B (Markdown) Gemma-3-12B/27B (Markdown, MLX-only) Dolphin (Markdown)
		Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE
		Purpose: Converts entire document pages to structured formats (DocTags or Markdown)
		Output Formats: DocTags (structured), Markdown (human-readable)
Picture Description Image captioning	Vision-Language Models	SmolVLM-256M ⭐ Granite-Vision-3.3-2B Pixtral-12B Qwen2.5-VL-3B
		Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE
		Purpose: Generates natural language descriptions of images and figures
Code & Formula Code/math extraction	Vision-Language Models	CodeFormulaV2 ⭐ Granite-Docling-258M
		Inference Engines: Transformers, MLX, AUTO_INLINE
		Purpose: Extracts and recognizes code blocks and mathematical formulas

## Inference Engines by Model Family ### Object Detection Models (Layout) | Model | Inference Engine | Supported Devices | |-------|------------------|-------------------| | All Layout models | docling-ibm-models | CPU, CUDA, MPS, XPU | **Note:** Layout models use a specialized RT-DETR-based object detection framework from `docling-ibm-models`. ### TableFormer Models (Table Structure) | Model | Inference Engine | Supported Devices | |-------|------------------|-------------------| | TableFormer (fast) | docling-ibm-models | CPU, CUDA, XPU | | TableFormer (accurate) | docling-ibm-models | CPU, CUDA, XPU | **Note:** MPS is currently disabled for TableFormer due to performance issues. ### Image Classifier (Picture Classifier) | Model | Inference Engine | Supported Devices | |-------|------------------|-------------------| | DocumentFigureClassifier-v2.5 | Transformers (ViT) | CPU, CUDA, MPS, XPU | ### OCR Engines | OCR Engine | Backend | Language Support | Notes | |------------|---------|------------------|-------| | Tesseract | CLI or tesserocr | 100+ languages | Most widely used, good accuracy | | EasyOCR | PyTorch | 80+ languages | GPU-accelerated, good for Asian languages | | RapidOCR | ONNX/OpenVINO/Paddle | Multiple | Fast, multiple backend options | | macOS Vision | Native macOS | 20+ languages | macOS only, excellent quality | | SuryaOCR | PyTorch | 90+ languages | Modern, good for complex layouts | | Auto | Automatic | Varies | Automatically selects best available engine | ### Vision-Language Models (VLM) #### VLM Convert Stage | Preset ID | Model | Parameters | Transformers | MLX | API (OpenAI-compatible) | vLLM | Output Format | |-----------|-------|------------|--------------|-----|-------------------------|------|---------------| | `granite_docling` | Granite-Docling-258M | 258M | ✅ | ✅ | Ollama | ❌ | DocTags | | `smoldocling` | SmolDocling-256M | 256M | ✅ | ✅ | ❌ | ❌ | DocTags | | `deepseek_ocr` | DeepSeek-OCR-3B | 3B | ❌ | ❌ | Ollama
LM Studio | ❌ | Markdown | | `granite_vision` | Granite-Vision-3.3-2B | 2B | ✅ | ❌ | Ollama
LM Studio | ✅ | Markdown | | `pixtral` | Pixtral-12B | 12B | ✅ | ✅ | ❌ | ❌ | Markdown | | `got_ocr` | GOT-OCR-2.0 | - | ✅ | ❌ | ❌ | ❌ | Markdown | | `phi4` | Phi-4-Multimodal | - | ✅ | ❌ | ❌ | ✅ | Markdown | | `qwen` | Qwen2.5-VL-3B | 3B | ✅ | ✅ | ❌ | ❌ | Markdown | | `nanonets_ocr2` | Nanonets-OCR2-3B | 3B | ✅ | ✅ | OpenAI-compatible
LM Studio | ✅ | Markdown | | `gemma_12b` | Gemma-3-12B | 12B | ❌ | ✅ | ❌ | ❌ | Markdown | | `gemma_27b` | Gemma-3-27B | 27B | ❌ | ✅ | ❌ | ❌ | Markdown | | `dolphin` | Dolphin | - | ✅ | ❌ | ❌ | ❌ | Markdown | `nanonets_ocr2` includes preset API overrides for OpenAI-compatible runtimes and LM Studio, and can also be used with vLLM runtimes. #### Picture Description Stage | Preset ID | Model | Parameters | Transformers | MLX | API (OpenAI-compatible) | vLLM | |-----------|-------|------------|--------------|-----|-------------------------|------| | `smolvlm` | SmolVLM-256M | 256M | ✅ | ✅ | LM Studio | ❌ | | `granite_vision` | Granite-Vision-3.3-2B | 2B | ✅ | ❌ | Ollama
LM Studio | ✅ | | `pixtral` | Pixtral-12B | 12B | ✅ | ✅ | ❌ | ❌ | | `qwen` | Qwen2.5-VL-3B | 3B | ✅ | ✅ | ❌ | ❌ | #### Code & Formula Stage | Preset ID | Model | Parameters | Transformers | MLX | |-----------|-------|------------|--------------|-----| | `codeformulav2` | CodeFormulaV2 | - | ✅ | ❌ | | `granite_docling` | Granite-Docling-258M | 258M | ✅ | ✅ | ## Usage Examples ### Layout Detection ```python from docling.datamodel.pipeline_options import LayoutOptions from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON # Use Heron layout model (default) layout_options = LayoutOptions(model_spec=DOCLING_LAYOUT_HERON) ``` ### Table Structure Recognition ```python from docling.datamodel.pipeline_options import TableStructureOptions, TableFormerMode # Use accurate mode for best quality table_options = TableStructureOptions( mode=TableFormerMode.ACCURATE, do_cell_matching=True ) ``` ### Picture Classification ```python from docling.models.stages.picture_classifier.document_picture_classifier import ( DocumentPictureClassifierOptions ) # Use default picture classifier classifier_options = DocumentPictureClassifierOptions.from_preset("document_figure_classifier_v2") ``` ### OCR ```python from docling.datamodel.pipeline_options import TesseractOcrOptions # Use Tesseract with English and German ocr_options = TesseractOcrOptions(lang=["eng", "deu"]) ``` ### VLM Convert (Full Page) ```python from docling.datamodel.pipeline_options import VlmConvertOptions # Use SmolDocling with auto-selected engine options = VlmConvertOptions.from_preset("smoldocling") # Or force specific engine from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions options = VlmConvertOptions.from_preset( "smoldocling", engine_options=MlxVlmEngineOptions() ) ``` ### Picture Description ```python from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions # Use Granite Vision for detailed descriptions options = PictureDescriptionVlmOptions.from_preset("granite_vision") ``` ### Code & Formula Extraction ```python from docling.datamodel.pipeline_options import CodeFormulaVlmOptions # Use specialized CodeFormulaV2 model options = CodeFormulaVlmOptions.from_preset("codeformulav2") ``` ## Additional Resources - [Vision Models Usage Guide](vision_models.md) - VLM-specific documentation - [Advanced Options](advanced_options.md) - Advanced configuration - [GPU Support](gpu.md) - GPU acceleration setup - [Supported Formats](supported_formats.md) - Input format support ## Notes - **DocTags Format:** Structured XML-like format optimized for document understanding - **Markdown Format:** Human-readable format for general-purpose conversion - **Model Updates:** New models are added regularly. Check the codebase for latest additions - **Engine Compatibility:** Not all engines work on all platforms. AUTO_INLINE handles this automatically - **Performance:** Actual performance varies by hardware, document complexity, and model size