feat: add threaded docling-parse (v6) PDF backend (#3377)

* feat: add threaded PDF backend consuming DoclingThreadedPdfParser

Introduces ThreadedDoclingParseDocumentBackend and
ThreadedDoclingParsePageBackend as a new PDF backend that drives
docling-parse's threaded API directly.  The matching StandardPdfPipeline
runs page parsing, OCR, layout, table, and assembly in concurrent
pipeline stages with a dedicated producer thread.

Backend contract additions:
- PdfPageBackend.page_no abstract property (all existing backends
  updated: pypdfium2, image, mets-gbs)
- PdfDocumentBackend.iter_pages() default via load_page(); threaded
  backend overrides to yield in completion order

ThreadedDoclingParseDocumentBackend specifics:
- Constructs one DoclingThreadedPdfParser per instance with fixed
  decode and render config; no pypdfium2 dependency
- Passes page_numbers=None for the default (all-pages) case; explicit
  range list only when the caller specifies a finite page_range
- page_count() delegates to parser.page_count(doc_key)
- Coordinate conversion (bottom-left âop-left) applied once in
  get_segmented_page() and cached; all downstream consumers (layout
  postprocessor, OCR merge, assembly) receive top-left cells

Pipeline (StandardPdfPipeline / PreprocessThreadedStage):
- Producer thread attaches page backends from iter_pages() to ordered
  page stubs by page_no, then enqueues ThreadedItems
- Invalid page backends are separated before model calls so they cannot
  be double-emitted if the preprocessing model raises
- Timeout and early-termination accounting tracks by page-number sets

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust tests and thread count source

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add threaded parse to CLI, make RGB images

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Improve threaded docling-parse backend integration

- add dedicated threaded docling-parse backend options and wire CLI
  num_threads into parser_threads
- make the threaded backend honor parser_threads, falling back to
  AcceleratorOptions only when unset
- resolve threaded page ranges explicitly and clip open-ended requests
  against the actual document length
- cache page sizes in StandardPdfPipeline so failed-page recovery does
  not call load_page() on iterator-only threaded backends
- reject threaded docling-parse in VLM pipelines that still require
  ordered/random load_page() access
- extend backend, CLI, and compatibility tests for the new threaded
  backend behavior
- update the editable docling-parse lock entry

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* allow more lines

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Address concerns with non-threaded PDF backend behaviour changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix nonsense test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* adding the  feature

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding perfomance measuring scripts

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding evaluation performance scripts

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* upgrading to docling-parse of 6.1.0

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* upgraded to docling-parse >=6.1

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* backend: disable bitmap byte materialization in docling-parse backends

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updates to iterate_pdf_pages script, lock update

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correct decode config

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pinned to docling-parse v6.2.0

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* clean up pyproject.toml

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Christoph Auer committed 24d ago
3c26f5a3a8a5904e45848bc1a9e43105fdeba3e3
Parent: 3a51932
Committed by GitHub <noreply@github.com> on 5/28/2026, 10:38:49 AM