feat: added support for parsing LaTeX (.tex) documents (#2890)
* feat: added support for parsing LaTeX (.tex) documents
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat: implement PR #2890 feedback for LaTeX backend
- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat: enhance latex backend with robustness fixes and ground truth
- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks
Addresses @cau-git feedback for ground-truth data.
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: minor formatting in test file
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* feat: enhance LaTeX backend with robust math and figure support
- Fixed re.error: bad escape in macro expansion by using lambda in re.sub
- Fixed sentences breaking at inline math ($) by preserving it within paragraphs
- Improved figure environment with proper grouping and structured representation
- Fixed crashes on documents starting with % comments
- Added comprehensive unit tests and updated all ground truth data
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* WIP: saving work for laptop migration
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* got rid of the line breaking issues, still some do exist
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: generalized LaTeX macro parsing and robustness improvements
This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* style: apply automatic formatting fixes
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* style: fix ruff linter and formatter errors
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: typing issues identified by mypy
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* style: apply formatting fixes to tests
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fix: update groundtruth files for latex backend
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* fixed the ackward line breaking issue, turns out im stupid at considering text buffer
* i forgot to add the groundtruth so here it is
* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: 7e032635ef3220da43175c3069eee6ca44f091f3
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: aeba6883843dd589bc4c93bcf91c24dfdd6ed036
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
* Ran the precommit as requested
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
---------
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> A
Aditya Sasidhar committed
e6ccb8b2c1d99fa6e2660d7c4bb866af7960bc2d
Parent: 9721321
Committed by GitHub <noreply@github.com>
on 2/10/2026, 2:13:09 PM