fix(pdf): propagate hyperlinks to DoclingDocument text items (#3131)
* fix(pdf): propagate hyperlinks to DoclingDocument text items docling-parse already extracts PdfHyperlink objects with bounding rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem already has a hyperlink field. However, the PDF pipeline never matched hyperlink annotations to text clusters — the data was available but never propagated. Add spatial matching of PDF hyperlinks to text clusters during page assembly, then pass the resolved hyperlink through the reading order model to the final DoclingDocument. Changes: - Add hyperlink field to TextElement (base_models.py) - Add _match_hyperlink() to PageAssembleModel that spatially matches cluster bboxes against hyperlink annotation rects, aggregating coverage per URI to handle wrapped links with multiple rects - Thread hyperlink= through add_text(), add_heading(), add_list_item() calls in ReadingOrderModel - Drop hyperlink on text merge when constituent clusters disagree - Fall back to Path when AnyUrl validation fails (matches HTML backend) - Regenerate affected ground truth files - Add unit tests for _match_hyperlink() edge cases Closes #3096 Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * fix(pdf): recover unmatched hyperlinks as REFERENCE items Track consumed hyperlink indices during cluster matching so that hyperlinks which don't meet the overlap threshold are not silently dropped. Unmatched hyperlinks that overlap text clusters are materialized as synthetic REFERENCE TextElements. Also propagate hyperlinks through FORMULA items in reading-order assembly. Signed-off-by: macbook <macbook@users.noreply.github.com> Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * DCO Remediation Commit for hussainarslan <m.hussain.arslan@gmail.com> I, hussainarslan <m.hussain.arslan@gmail.com>, hereby add my Signed-off-by to this commit: 71a8d900bd22f0d3c377215cad99b40794d49d59 Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * test: regenerate reference data for hyperlink propagation Update groundtruth files for 2206.01062, 2305.03393v1, and textbox.docx to reflect hyperlink fields on text items and new REFERENCE items for unmatched hyperlinks. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * Revert "test: regenerate reference data for hyperlink propagation" This reverts commit 374f478ebf71e7e43b1b98d7106375c7f3d77101. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * Revert "fix(pdf): recover unmatched hyperlinks as REFERENCE items" This reverts commit e0e9b9225fa5caa0a7b2578a29600a9531edc624. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * test: regenerate groundtruth for hyperlink propagation Regenerate the affected docling_v2 PDF and DOCX fixtures after rerunning the hyperlink propagation groundtruth suite and switch the hyperlink coverage selection helper to the explicit items() form to avoid a type ignore. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> * test: regenerate groundtruth for docling_core 1.10.0 Regenerate the affected docling_v2 PDF and DOCX fixtures with the current docling_core schema version so committed groundtruth stays compatible with CI and example loading. Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> --------- Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com> Signed-off-by: macbook <macbook@users.noreply.github.com>
H
Hussain Arslan committed
524edcce73869a87b6ccf73bc16324742bd36648
Parent: 3a64f41
Committed by GitHub <noreply@github.com>
on 3/31/2026, 6:58:21 AM