fix: handle deeply nested HTML that triggers RecursionError (#1644)

* fix: handle deeply nested HTML that triggers RecursionError (#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>

jigangz committed 1mo ago

604bba13da2f43b756b49122cb65bdedb85b1dff

Parent: 63cbbd9

Committed by GitHub <noreply@github.com> on 4/15/2026, 10:26:44 PM