SIGN IN SIGN UP

fix(discovery): root-cause the smartextract page failures

Three independent root causes for the 'PowerToFly/FlexJobs/Remote.co
all timing out' pattern in tonight's run:

(1) networkidle never settles on tracker-heavy sites. PowerToFly
    loads its DOM in 0.7s but ad/analytics pings keep the network
    busy for >60s, so the wait_for_load_state('networkidle')
    timeout would always blow up. domcontentloaded is the right
    wait condition for snapshotting the rendered DOM; the
    networkidle wait is now opportunistic with a 10s ceiling and
    swallowed on timeout.

(2) Vanilla playwright trips Cloudflare's TLS fingerprint, causing
    HTTP/2 protocol errors on bot-protected sites. Switched the
    smartextract and enrichment scrapers to patchright, which is
    a drop-in playwright fork with TLS-fingerprint and JS-stealth
    patches. Falls back to playwright on import error so unit-test
    environments without patchright still work.

(3) Phase 1/2 LLM calls returned just '```json' with no body when
    Gemini's thinking budget overran the (now 8192-token) cap. The
    extract_json parser threw on the empty stripped body. Now both
    phases detect a truncated/empty response *before* parsing and
    transparently retry once with the quality fallback chain
    (Pro/GPT-4/Sonnet) — empirically these models actually return
    structured output instead of just opening the fence.

FlexJobs and Remote.co remain unreachable from this network — even
plain `curl https://www.flexjobs.com/` returns nothing, while
PowerToFly responds normally. That's an IP-level Akamai block, not
something patchright can fix; documenting it here so we don't keep
re-investigating. Their failures now fall through to the existing
ERROR path in collect_page_intelligence and the run continues.

306 unit tests pass; live PowerToFly probe finishes in 10.9s
(was: 60s timeout) and returns 11 captured API responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A
Alex Ibarra committed
e9bebd5e5b5ae67ddbb18f0efa295ad68fd06d12
Parent: b311cbb