feat(discovery): builtin.com HTML scraper (config-driven cities/categories)
builtin.com is fully server-side rendered (Drupal stack) — no Algolia,
no JSON API, no GraphQL. Verified via Patchright network probe: only
analytics, auth, fontawesome, and a CF challenge fire on page load.
The rendered HTML is rich, though — each job-card carries job_id,
title, company slug, work mode, location, salary range, seniority,
and a short summary, so we can skip enrichment and seed scoring
straight from the listing.
Configuration lives in searches.yaml under a new builtin: block:
builtin:
cities: ["Seattle"] # empty = global, no city filter
categories: ["dev-engineering"]
remote_only: false # if true, /jobs/remote/{cat} prefix
Defaults: categories=["dev-engineering"], cities=[] (global),
remote_only=False. Other users in NYC/SF/etc. just edit searches.yaml.
Pagination via ?page=N until an empty/404 page; capped at MAX_PAGES=50
as a safety net (Seattle/dev-engineering tops out around page 22).
0.5s delay between page fetches keeps us under the CF rate limit.
INSERTs go through write_with_retry like the other scrapers tonight.
Smoke run on Seattle: 249 new jobs across 28 pages in 27.2s. 306
unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> A
Alex Ibarra committed
5e2dbae3460a58ef3bf170dcd4a078a2ab5d99bf
Parent: f5faae1