SIGN IN SIGN UP

fix(apply): runtime guards + tests for null application_url and dedup

Caught during the live score-10 apply run today: worker-0 picked up a
LinkedIn job whose enrichment never extracted an application_url, then
the agent went to the LinkedIn listing page and tried to apply against
that — got confused, marked the job in_progress with no real target.
Separately, Temporal Technologies skipped the per-company cap entirely
because resolve_company_key() returns None for jobs whose `company`
column is NULL and whose `strategy` is an aggregator (jobspy/linkedin).

apply/launcher.py — acquire_job:
  - Add `application_url IS NOT NULL AND application_url != ''` to the
    candidate WHERE clause. Aggregator-discovered rows that never got a
    real apply URL no longer leak into the apply queue.
  - Defensive post-SELECT check: if a row somehow makes it through with
    no application_url anyway, mark it `manual_only` and emit the state
    transition rather than fall through to row["url"] (which is the
    listing page on aggregators). One-time DB cleanup ran 358 rows.
  - Drop `apply_url = row["application_url"] or row["url"]` fallback —
    if application_url is missing we shouldn't apply at all.

scoring/tailor.py — resolve_company_key:
  - Add ATS-tenant slug extraction from `application_url` for Greenhouse
    (incl. boards.greenhouse.io, job-boards.greenhouse.io, and the .eu
    subdomain), Lever, Ashby, and Workday. This fixes the Temporal case:
    a LinkedIn-discovered job whose Apply button resolves to
    job-boards.greenhouse.io/temporaltechnologies/... now buckets to
    `temporaltechnologies` instead of None, so the per-company cap
    actually fires.
  - Resolution priority documented: explicit `company` column wins,
    then ATS-tenant slug from `application_url`, then `site` for direct
    employer strategies (existing behavior).

pipeline.py — _run_stage_streaming:
  - Backoff when the runner reports zero progress despite a non-zero
    pending count. Caps + filters can drop every candidate; without
    this the streaming loop hot-spins logging "No untailored jobs ..."
    every millisecond. Now waits one poll interval before retry.

tests/test_acquire_job_dedup.py (new): 5 tests covering
  - duplicate application_url with sibling already applied / in_progress
  - a row never blocks itself
  - NULL / empty application_url skipped
  - mixed candidate set: only the one with a valid apply URL fires

tests/test_resolve_company_key.py (new): 10 tests covering Greenhouse
  (.com/.eu/legacy hosts), Lever, Ashby, Workday slug extraction;
  resolution priority; and the empty-input edge case.

DB cleanup applied to ~/.applypilot/applypilot.db: 358 rows with no
application_url moved from queue → `manual_only`. The 3 in_progress
rows from the killed apply run reset back to NULL.

company_limits.yaml: Temporal Technologies capped at 1 in-flight
(matching Netflix). With the new application_url-based resolver this
cap now actually applies to LinkedIn-discovered Temporal jobs.

321 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A
Alex Ibarra committed
ea783ace539ff0397199c09fd6bebc11bb321c30
Parent: 7540206