SIGN IN SIGN UP

feat: added support for parsing LaTeX (.tex) documents (#2890)

* feat: added support for parsing LaTeX (.tex) documents

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: implement PR #2890 feedback for LaTeX backend

- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: enhance latex backend with robustness fixes and ground truth

- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks

Addresses @cau-git feedback for ground-truth data.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: minor formatting in test file

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: enhance LaTeX backend with robust math and figure support

- Fixed re.error: bad escape in macro expansion by using lambda in re.sub
- Fixed sentences breaking at inline math ($) by preserving it within paragraphs
- Improved figure environment with proper grouping and structured representation
- Fixed crashes on documents starting with % comments
- Added comprehensive unit tests and updated all ground truth data

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* WIP: saving work for laptop migration

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* got rid of the line breaking issues, still some do exist

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: generalized LaTeX macro parsing and robustness improvements

This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: apply automatic formatting fixes

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: fix ruff linter and formatter errors

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: typing issues identified by mypy

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: apply formatting fixes to tests

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: update groundtruth files for latex backend

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fixed the ackward line breaking issue, turns out im stupid at considering text buffer

* i forgot to add the groundtruth so here it is

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: 7e032635ef3220da43175c3069eee6ca44f091f3
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: aeba6883843dd589bc4c93bcf91c24dfdd6ed036

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* Ran the precommit as requested

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

---------

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
A
Aditya Sasidhar committed
e6ccb8b2c1d99fa6e2660d7c4bb866af7960bc2d
Parent: 9721321
Committed by GitHub <noreply@github.com> on 2/10/2026, 2:13:09 PM