feat: add support for Word document comments extraction (#2834)
* feat: add support for Word document comments extraction (fixes #485) Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: address PR review feedback for comments extraction - Change DocItemLabel.PARAGRAPH to TEXT (deprecating PARAGRAPH) - Change initials format from '(initials)' to 'author: initials' - Change timestamp format to include 'time:' prefix - Update test assertions and regenerate ground truth files Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * chore: update comment format and move format documentation from inline comment to function docstring Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * Use docling-core v2.58.0 add_comment() API to properly link Word document comments to their annotated text items via FineRef references. - Import FineRef from docling_core.types.doc.document - Refactor _add_comments to use doc.add_comment(targets=[...]) API - Parse DOCX XML for commentRangeStart/End markers in _extract_comment_ranges - Track paragraph-to-items mapping for comment linking - Fallback to unlinked comments in COMMENT_SECTION group when no targets found Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * - Extract comment IDs directly during paragraph element processing to match element IDs - Clear paragraph mappings at start of each conversion for consistent behavior - Always create comment groups and use add_comment() API with targets - Add _get_comment_ids_for_element() helper to extract comment markers from XML - Regenerate ground-truth files (JSON/MD/itxt) with comments field properly linked Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: remove incorrect ground-truth files, keep versions with comments field Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> * fix: reference comment groups instead of text items in comments field Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> --------- Signed-off-by: s1v4-d <leelasaisivasubrahmanyamdurga@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
S
Siva committed
b6ca09451963c606b5d280b74e559278717bb911
Parent: e413e68
Committed by GitHub <noreply@github.com>
on 1/26/2026, 8:58:46 AM