feat: add threaded docling-parse (v6) PDF backend (#3377)
* feat: add threaded PDF backend consuming DoclingThreadedPdfParser Introduces ThreadedDoclingParseDocumentBackend and ThreadedDoclingParsePageBackend as a new PDF backend that drives docling-parse's threaded API directly. The matching StandardPdfPipeline runs page parsing, OCR, layout, table, and assembly in concurrent pipeline stages with a dedicated producer thread. Backend contract additions: - PdfPageBackend.page_no abstract property (all existing backends updated: pypdfium2, image, mets-gbs) - PdfDocumentBackend.iter_pages() default via load_page(); threaded backend overrides to yield in completion order ThreadedDoclingParseDocumentBackend specifics: - Constructs one DoclingThreadedPdfParser per instance with fixed decode and render config; no pypdfium2 dependency - Passes page_numbers=None for the default (all-pages) case; explicit range list only when the caller specifies a finite page_range - page_count() delegates to parser.page_count(doc_key) - Coordinate conversion (bottom-left âop-left) applied once in get_segmented_page() and cached; all downstream consumers (layout postprocessor, OCR merge, assembly) receive top-left cells Pipeline (StandardPdfPipeline / PreprocessThreadedStage): - Producer thread attaches page backends from iter_pages() to ordered page stubs by page_no, then enqueues ThreadedItems - Invalid page backends are separated before model calls so they cannot be double-emitted if the preprocessing model raises - Timeout and early-termination accounting tracks by page-number sets Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Adjust tests and thread count source Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add threaded parse to CLI, make RGB images Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Improve threaded docling-parse backend integration - add dedicated threaded docling-parse backend options and wire CLI num_threads into parser_threads - make the threaded backend honor parser_threads, falling back to AcceleratorOptions only when unset - resolve threaded page ranges explicitly and clip open-ended requests against the actual document length - cache page sizes in StandardPdfPipeline so failed-page recovery does not call load_page() on iterator-only threaded backends - reject threaded docling-parse in VLM pipelines that still require ordered/random load_page() access - extend backend, CLI, and compatibility tests for the new threaded backend behavior - update the editable docling-parse lock entry Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * allow more lines Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Address concerns with non-threaded PDF backend behaviour changes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix nonsense test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * adding the feature Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding perfomance measuring scripts Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding evaluation performance scripts Signed-off-by: Peter Staar <taa@zurich.ibm.com> * upgrading to docling-parse of 6.1.0 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * upgraded to docling-parse >=6.1 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * backend: disable bitmap byte materialization in docling-parse backends Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updates to iterate_pdf_pages script, lock update Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Correct decode config Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * pinned to docling-parse v6.2.0 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * clean up pyproject.toml Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com>
C
Christoph Auer committed
3c26f5a3a8a5904e45848bc1a9e43105fdeba3e3
Parent: 3a51932
Committed by GitHub <noreply@github.com>
on 5/28/2026, 10:38:49 AM