# Model Catalog This document provides a comprehensive overview of all models and inference engines available in Docling, organized by processing stage. ## Overview Docling's document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog helps you understand: - What stages are available for document processing - Which model families power each stage - What specific models you can use - Which inference engines support each model ## Stages and Models Overview The following table shows all processing stages in Docling, their model families, and available models.
| Stage | Model Family | Models |
|---|---|---|
| Layout Document structure detection |
Object Detection (RT-DETR based) |
|
| Inference Engine: Transformers, ONNXRuntime (in progress) | ||
| Purpose: Detects document elements (paragraphs, tables, figures, headers, etc.) | ||
| Output: Bounding boxes with element labels (TEXT, TABLE, PICTURE, SECTION_HEADER, etc.) | ||
| OCR Text recognition |
Multiple OCR Engines |
|
| Inference Engines: Engine-specific | ||
| Purpose: Extracts text from images and scanned documents | ||
| Table Structure Table cell recognition |
TableFormer |
|
| Inference Engine: docling-ibm-models | ||
| Purpose: Recognizes table structure (rows, columns, cells) and relationships | ||
| Table Structure Table cell recognition |
Vision-Language Model (Granite Vision) |
|
| Inference Engine: Transformers | ||
| Purpose: VLM-based table structure recognition using OTSL (Open Table Structure Language) output | ||
| Table Structure Table cell recognition |
Object Detection |
|
| Inference Engine: TBD | ||
| Purpose: Alternative approach for table structure recognition using object detection | ||
| Picture Classifier Image type classification |
Image Classifier (Vision Transformer) |
|
| Inference Engine: Transformers | ||
| Purpose: Classifies pictures into categories (Chart, Diagram, Natural Image, etc.) | ||
| VLM Convert Full page conversion |
Vision-Language Models |
|
| Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE | ||
| Purpose: Converts entire document pages to structured formats (DocTags or Markdown) | ||
| Output Formats: DocTags (structured), Markdown (human-readable) | ||
| Picture Description Image captioning |
Vision-Language Models |
|
| Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE | ||
| Purpose: Generates natural language descriptions of images and figures | ||
| Code & Formula Code/math extraction |
Vision-Language Models |
|
| Inference Engines: Transformers, MLX, AUTO_INLINE | ||
| Purpose: Extracts and recognizes code blocks and mathematical formulas | ||