fix(documents): accept single-letter item suffixes so company-specific items (CAT Item 1D) validate (edgartools-4agg)
The canonical-key recognizers capped item suffixes at [a-c], so a legitimate company-specific item like Caterpillar's "Item 1D" (Information about our Executive Officers) was flagged non-canonical even though it was extracted with correct boundaries. SEC standard items only go to 1C, but filers voluntarily add others; the structure item_<number><letter> is unambiguous, so widen the suffix to any single letter [a-z]: - _canonical_item_count (body-header fallback gating) - _CANONICAL_ITEM_KEY / _BARE_ITEM_KEY (_is_valid_section_key validator) - _BODY_ITEM_HEADER (body-header detection regex — previously failed to match an "Item 1D. Title" heading at all, silently dropping it) Descriptive free-text keys still rejected (they don't match item_<num><letter>$). CAT now scans fully canonical (24/24); corpus non-canonical keys drop from 2 to 1 (only wfc remains — the label-less/link-less Citi-class case, tracked in 4agg). Tests: test_is_valid_section_key extended with Item 1D; TestCaterpillarItem1D ground-truth (Executive Officers content, full Part-I item-1 family canonical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
D
Dwight Gunning committed
c9cd6f1b44db4e88ae1c56136c1cec16bda78ca8
Parent: d5df418