build(scripts): add stress-test data generator for migration timing
Add ``scripts/seed_junction_load.py``, a backend-agnostic script that
bulk-inserts synthetic parent rows (dashboards, slices, users, roles,
tables, dbs) and many-to-many junction rows for the four largest
association tables targeted by the composite-PK migration:
``dashboard_slices``, ``slice_user``, ``dashboard_user``,
``dashboard_roles``.
Designed for measuring migration runtime at varying scales — run with
a series of size flags (100K / 1M / 5M / 10M for the target table)
and time the migration at each scale to verify the predicted
``O(N log N)`` extrapolation against real numbers.
Properties:
- **Reproducible**: deterministic cross-product walk through parent IDs
produces a stable pair sequence; re-running is replayable.
- **Idempotent**: re-running with the same target is a no-op; with a
higher target, only new rows are added.
- **Backend-agnostic**: connects via Superset's standard ``DATABASE_*``
env vars (or ``SUPERSET__SQLALCHEMY_DATABASE_URI``). Branches on
dialect for ``BINARY(16)`` vs ``UUID`` vs TEXT/BLOB UUID columns.
- **Batched**: bulk INSERT 10K rows per statement.
- **Per-phase timing**: logs elapsed wall time for the parents phase,
the junctions phase as a whole, and per junction-table.
- **Avoidance set**: loads existing junction pairs into a Python set
so re-runs on top of pre-existing data don't collide on the
uniqueness constraint.
Usage (inside the Superset container):
docker exec superset-superset-1 \\
/app/.venv/bin/python /app/scripts/seed_junction_load.py \\
--dashboard-slices 1000000
Defaults target a "large multi-team install" shape: 1M
``dashboard_slices``, 100K each ``slice_user`` / ``dashboard_user``,
10K ``dashboard_roles``. Override per-table via flags.
Tested locally on MySQL (the user's current eval stack):
- 200/100/100/50 row mini-run produced expected counts.
- Re-running at the same target is a no-op (idempotent).
- ``--dry-run`` plans without writing.
Junction tables not yet covered (``sqlatable_user``, ``rls_filter_*``,
``report_schedule_user``) are typically small in production and
require additional parent seeding (RLS filters, report schedules)
that wasn't worth the scope here. Adding them is straightforward by
extending ``JUNCTIONS`` and writing the corresponding parent seeder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> M
Mike Bridge committed
58a1a1a8d136a09360e26880f9be4d812b2eeee3
Parent: fef0a64