SIGN IN SIGN UP

fix(pdf): propagate hyperlinks to DoclingDocument text items (#3131)

* fix(pdf): propagate hyperlinks to DoclingDocument text items

docling-parse already extracts PdfHyperlink objects with bounding
rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem
already has a hyperlink field. However, the PDF pipeline never matched
hyperlink annotations to text clusters — the data was available but
never propagated.

Add spatial matching of PDF hyperlinks to text clusters during page
assembly, then pass the resolved hyperlink through the reading order
model to the final DoclingDocument.

Changes:
- Add hyperlink field to TextElement (base_models.py)
- Add _match_hyperlink() to PageAssembleModel that spatially matches
  cluster bboxes against hyperlink annotation rects, aggregating
  coverage per URI to handle wrapped links with multiple rects
- Thread hyperlink= through add_text(), add_heading(), add_list_item()
  calls in ReadingOrderModel
- Drop hyperlink on text merge when constituent clusters disagree
- Fall back to Path when AnyUrl validation fails (matches HTML backend)
- Regenerate affected ground truth files
- Add unit tests for _match_hyperlink() edge cases

Closes #3096

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* fix(pdf): recover unmatched hyperlinks as REFERENCE items

Track consumed hyperlink indices during cluster matching so that
hyperlinks which don't meet the overlap threshold are not silently
dropped. Unmatched hyperlinks that overlap text clusters are
materialized as synthetic REFERENCE TextElements. Also propagate
hyperlinks through FORMULA items in reading-order assembly.

Signed-off-by: macbook <macbook@users.noreply.github.com>
Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* DCO Remediation Commit for hussainarslan <m.hussain.arslan@gmail.com>

I, hussainarslan <m.hussain.arslan@gmail.com>, hereby add my Signed-off-by to this commit: 71a8d900bd22f0d3c377215cad99b40794d49d59

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* test: regenerate reference data for hyperlink propagation

Update groundtruth files for 2206.01062, 2305.03393v1, and
textbox.docx to reflect hyperlink fields on text items and
new REFERENCE items for unmatched hyperlinks.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* Revert "test: regenerate reference data for hyperlink propagation"

This reverts commit 374f478ebf71e7e43b1b98d7106375c7f3d77101.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* Revert "fix(pdf): recover unmatched hyperlinks as REFERENCE items"

This reverts commit e0e9b9225fa5caa0a7b2578a29600a9531edc624.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* test: regenerate groundtruth for hyperlink propagation

Regenerate the affected docling_v2 PDF and DOCX fixtures after rerunning the hyperlink propagation groundtruth suite and switch the hyperlink coverage selection helper to the explicit items() form to avoid a type ignore.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

* test: regenerate groundtruth for docling_core 1.10.0

Regenerate the affected docling_v2 PDF and DOCX fixtures with the current docling_core schema version so committed groundtruth stays compatible with CI and example loading.

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>

---------

Signed-off-by: hussainarslan <m.hussain.arslan@gmail.com>
Signed-off-by: macbook <macbook@users.noreply.github.com>
H
Hussain Arslan committed
524edcce73869a87b6ccf73bc16324742bd36648
Parent: 3a64f41
Committed by GitHub <noreply@github.com> on 3/31/2026, 6:58:21 AM