fix(discovery): root-cause the smartextract page failures
Three independent root causes for the 'PowerToFly/FlexJobs/Remote.co
all timing out' pattern in tonight's run:
(1) networkidle never settles on tracker-heavy sites. PowerToFly
loads its DOM in 0.7s but ad/analytics pings keep the network
busy for >60s, so the wait_for_load_state('networkidle')
timeout would always blow up. domcontentloaded is the right
wait condition for snapshotting the rendered DOM; the
networkidle wait is now opportunistic with a 10s ceiling and
swallowed on timeout.
(2) Vanilla playwright trips Cloudflare's TLS fingerprint, causing
HTTP/2 protocol errors on bot-protected sites. Switched the
smartextract and enrichment scrapers to patchright, which is
a drop-in playwright fork with TLS-fingerprint and JS-stealth
patches. Falls back to playwright on import error so unit-test
environments without patchright still work.
(3) Phase 1/2 LLM calls returned just '```json' with no body when
Gemini's thinking budget overran the (now 8192-token) cap. The
extract_json parser threw on the empty stripped body. Now both
phases detect a truncated/empty response *before* parsing and
transparently retry once with the quality fallback chain
(Pro/GPT-4/Sonnet) — empirically these models actually return
structured output instead of just opening the fence.
FlexJobs and Remote.co remain unreachable from this network — even
plain `curl https://www.flexjobs.com/` returns nothing, while
PowerToFly responds normally. That's an IP-level Akamai block, not
something patchright can fix; documenting it here so we don't keep
re-investigating. Their failures now fall through to the existing
ERROR path in collect_page_intelligence and the run continues.
306 unit tests pass; live PowerToFly probe finishes in 10.9s
(was: 60s timeout) and returns 11 captured API responses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> A
Alex Ibarra committed
e9bebd5e5b5ae67ddbb18f0efa295ad68fd06d12
Parent: b311cbb