SIGN IN SIGN UP

fix(documents): accept single-letter item suffixes so company-specific items (CAT Item 1D) validate (edgartools-4agg)

The canonical-key recognizers capped item suffixes at [a-c], so a legitimate
company-specific item like Caterpillar's "Item 1D" (Information about our
Executive Officers) was flagged non-canonical even though it was extracted with
correct boundaries. SEC standard items only go to 1C, but filers voluntarily
add others; the structure item_<number><letter> is unambiguous, so widen the
suffix to any single letter [a-z]:

- _canonical_item_count (body-header fallback gating)
- _CANONICAL_ITEM_KEY / _BARE_ITEM_KEY (_is_valid_section_key validator)
- _BODY_ITEM_HEADER (body-header detection regex — previously failed to match
  an "Item 1D. Title" heading at all, silently dropping it)

Descriptive free-text keys still rejected (they don't match item_<num><letter>$).
CAT now scans fully canonical (24/24); corpus non-canonical keys drop from 2 to 1
(only wfc remains — the label-less/link-less Citi-class case, tracked in 4agg).
Tests: test_is_valid_section_key extended with Item 1D; TestCaterpillarItem1D
ground-truth (Executive Officers content, full Part-I item-1 family canonical).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
D
Dwight Gunning committed
c9cd6f1b44db4e88ae1c56136c1cec16bda78ca8
Parent: d5df418