SIGN IN SIGN UP
microsoft / markitdown UNCLAIMED

Python tool for converting files and office documents to Markdown.

0 0 73 Python

fix: handle deeply nested HTML that triggers RecursionError (#1644)

* fix: handle deeply nested HTML that triggers RecursionError (#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>
J
jigangz committed
604bba13da2f43b756b49122cb65bdedb85b1dff
Parent: 63cbbd9
Committed by GitHub <noreply@github.com> on 4/15/2026, 10:26:44 PM