SIGN IN SIGN UP

feat: add support for Word document comments extraction (#2834)

* feat: add support for Word document comments extraction (fixes #485)

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: address PR review feedback for comments extraction

- Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH)
- Change initials format from '(initials)' to 'author: initials'
- Change timestamp format to include 'time:' prefix
- Update test assertions and regenerate ground truth files

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* chore: update comment format and move format documentation from inline comment to function docstring

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* Use docling-core v2.58.0 add_comment() API to properly link Word document
comments to their annotated text items via FineRef references.

- Import FineRef from docling_core.types.doc.document
- Refactor _add_comments to use doc.add_comment(targets=[...]) API
- Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges
- Track paragraph-to-items mapping for comment linking
- Fallback to unlinked comments in COMMENT_SECTION group when no targets found

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* - Extract comment IDs directly during paragraph element processing to match element IDs
- Clear paragraph mappings at start of each conversion for consistent behavior
- Always create comment groups and use add_comment() API with targets
- Add _get_comment_ids_for_element() helper to extract comment markers from XML
- Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: remove incorrect ground-truth files, keep versions with comments field

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

* fix: reference comment groups instead of text items in comments field

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>

---------

Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
S
Siva committed
b6ca09451963c606b5d280b74e559278717bb911
Parent: e413e68
Committed by GitHub <noreply@github.com> on 1/26/2026, 8:58:46 AM