fix(apply): runtime guards + tests for null application_url and dedup
Caught during the live score-10 apply run today: worker-0 picked up a
LinkedIn job whose enrichment never extracted an application_url, then
the agent went to the LinkedIn listing page and tried to apply against
that — got confused, marked the job in_progress with no real target.
Separately, Temporal Technologies skipped the per-company cap entirely
because resolve_company_key() returns None for jobs whose `company`
column is NULL and whose `strategy` is an aggregator (jobspy/linkedin).
apply/launcher.py — acquire_job:
- Add `application_url IS NOT NULL AND application_url != ''` to the
candidate WHERE clause. Aggregator-discovered rows that never got a
real apply URL no longer leak into the apply queue.
- Defensive post-SELECT check: if a row somehow makes it through with
no application_url anyway, mark it `manual_only` and emit the state
transition rather than fall through to row["url"] (which is the
listing page on aggregators). One-time DB cleanup ran 358 rows.
- Drop `apply_url = row["application_url"] or row["url"]` fallback —
if application_url is missing we shouldn't apply at all.
scoring/tailor.py — resolve_company_key:
- Add ATS-tenant slug extraction from `application_url` for Greenhouse
(incl. boards.greenhouse.io, job-boards.greenhouse.io, and the .eu
subdomain), Lever, Ashby, and Workday. This fixes the Temporal case:
a LinkedIn-discovered job whose Apply button resolves to
job-boards.greenhouse.io/temporaltechnologies/... now buckets to
`temporaltechnologies` instead of None, so the per-company cap
actually fires.
- Resolution priority documented: explicit `company` column wins,
then ATS-tenant slug from `application_url`, then `site` for direct
employer strategies (existing behavior).
pipeline.py — _run_stage_streaming:
- Backoff when the runner reports zero progress despite a non-zero
pending count. Caps + filters can drop every candidate; without
this the streaming loop hot-spins logging "No untailored jobs ..."
every millisecond. Now waits one poll interval before retry.
tests/test_acquire_job_dedup.py (new): 5 tests covering
- duplicate application_url with sibling already applied / in_progress
- a row never blocks itself
- NULL / empty application_url skipped
- mixed candidate set: only the one with a valid apply URL fires
tests/test_resolve_company_key.py (new): 10 tests covering Greenhouse
(.com/.eu/legacy hosts), Lever, Ashby, Workday slug extraction;
resolution priority; and the empty-input edge case.
DB cleanup applied to ~/.applypilot/applypilot.db: 358 rows with no
application_url moved from queue → `manual_only`. The 3 in_progress
rows from the killed apply run reset back to NULL.
company_limits.yaml: Temporal Technologies capped at 1 in-flight
(matching Netflix). With the new application_url-based resolver this
cap now actually applies to LinkedIn-discovered Temporal jobs.
321 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> A
Alex Ibarra committed
ea783ace539ff0397199c09fd6bebc11bb321c30
Parent: 7540206