0 0 0 Rust
docs: add CLAUDE.md, architecture docs, and /sesh-mode skill (#6247)

* feat: replace fixed MetricDataPoint fields with dynamic tag HashMap

* feat: replace ParquetField enum with constants and dynamic validation

* feat: derive sort order and bloom filters from batch schema

* feat: union schema accumulation and schema-agnostic ingest validation

* feat: dynamic column lookup in split writer

* feat: remove ParquetSchema dependency from indexing actors

* refactor: deduplicate test batch helpers

* lint

* feat(31): sort schema foundation — proto, parser, display, validation, window, TableConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rustdoc link errors — use backticks for private items

* feat(31): compaction metadata types — extend split metadata, postgres model, field lookup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(31): wire TableConfig into sort path, add compaction KV metadata

Wire TableConfig-driven sort order into ParquetWriter and add
self-describing Parquet file metadata for compaction:

- ParquetWriter::new() takes &TableConfig, resolves sort fields at
  construction via parse_sort_fields() + ParquetField::from_name()
- sort_batch() uses resolved fields with per-column direction (ASC/DESC)
- SS-1 debug_assert verification: re-sort and check identity permutation
- build_compaction_key_value_metadata(): embeds sort_fields, window_start,
  window_duration, num_merge_ops, row_keys (base64) in Parquet kv_metadata
- SS-5 verify_ss5_kv_consistency(): kv_metadata matches source struct
- write_to_file_with_metadata() replaces write_to_file()
- prepare_write() shared method for bytes and file paths
- ParquetWriterConfig gains to_writer_properties_with_metadata()
- ParquetSplitWriter passes TableConfig through
- All callers in quickwit-indexing updated with TableConfig::default()
- 23 storage tests pass including META-07 self-describing roundtrip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(31): PostgreSQL migration 27 + compaction columns in stage/list/publish

Add compaction metadata to the PostgreSQL metastore:

Migration 27:
- 6 new columns: window_start, window_duration_secs, sort_fields,
  num_merge_ops, row_keys, zonemap_regexes
- Partial index idx_metrics_splits_compaction_scope on
  (index_uid, sort_fields, window_start) WHERE split_state = 'Published'

stage_metrics_splits:
- INSERT extended from 15 to 21 bind parameters for compaction columns
- ON CONFLICT SET updates all compaction columns

list_metrics_splits:
- PgMetricsSplit construction includes compaction fields (defaults from JSON)

Also fixes pre-existing compilation errors on upstream-10b-parquet-actors:
- Missing StageMetricsSplitsRequestExt import
- index_id vs index_uid type mismatches in publish/mark/delete
- IndexUid binding (to_string() for sqlx)
- ListMetricsSplitsResponseExt trait disambiguation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(31): close port gaps — split_writer metadata, compaction scope, publish validation

Close critical gaps identified during port review:

split_writer.rs:
- Store table_config on ParquetSplitWriter (not just pass-through)
- Compute window_start from batch time range using table_config.window_duration_secs
- Populate sort_fields, window_duration_secs, parquet_files on metadata before write
- Call write_to_file_with_metadata(Some(&metadata)) to embed KV metadata in Parquet
- Update size_bytes after write completes

metastore/mod.rs:
- Add window_start and sort_fields fields to ListMetricsSplitsQuery
- Add with_compaction_scope() builder method

metastore/postgres/metastore.rs:
- Add compaction scope filters (AND window_start = $N, AND sort_fields = $N) to list query
- Add replaced_split_ids count verification in publish_metrics_splits
- Bind compaction scope query parameters

ingest/config.rs:
- Add table_config: TableConfig field to ParquetIngestConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(31): final gap fixes — file-backed scope filter, META-07 test, dead code removal

- file_backed_index/mod.rs: Add window_start and sort_fields filtering
  to metrics_split_matches_query() for compaction scope queries
- writer.rs: Add test_meta07_self_describing_parquet_roundtrip test
  (writes compaction metadata to Parquet, reads back from cold file,
  verifies all fields roundtrip correctly)
- fields.rs: Remove dead sort_order() method (replaced by TableConfig)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(31): correct postgres types for window_duration_secs and zonemap_regexes

Gap 1: Change window_duration_secs from i32 to Option<i32> in both
PgMetricsSplit and InsertableMetricsSplit. Pre-Phase-31 splits now
correctly map 0 → NULL in PostgreSQL, enabling Phase 32 compaction
queries to use `WHERE window_duration_secs IS NOT NULL` instead of
the fragile `WHERE window_duration_secs > 0`.

Gap 2: Change zonemap_regexes from String to serde_json::Value in
both structs. This maps directly to JSONB in sqlx, avoiding ambiguity
when PostgreSQL JSONB operators are used in Phase 34/35 zonemap pruning.

Gap 3: Add two missing tests:
- test_insertable_from_metadata_with_compaction_fields: verifies all 6
  compaction fields round-trip through InsertableMetricsSplit
- test_insertable_from_metadata_pre_phase31_defaults: verifies pre-Phase-31
  metadata produces window_duration_secs: None, zonemap_regexes: json!({})

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: rustfmt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(31): add metrics split test suite to shared metastore_test_suite! macro

11 tests covering the full metrics split lifecycle:
- stage (happy path + non-existent index error)
- stage upsert (ON CONFLICT update)
- list by state, time range, metric name, compaction scope
- publish (happy path + non-existent split error)
- mark for deletion
- delete (happy path + idempotent non-existent)

Tests are generic and run against both file-backed and PostgreSQL backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(31): read compaction columns in list_metrics_splits, fix cleanup_index FK

* fix(31): correct error types for non-existent metrics splits

- publish_metrics_splits: return NotFound (not FailedPrecondition) when
  staged splits don't exist
- delete_metrics_splits: succeed silently (idempotent) for non-existent
  splits instead of returning FailedPrecondition
- Tests now assert the correct error types on both backends

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: rustfmt metastore tests and postgres

* fix(31): address PR review — align metrics_splits with splits table

- Migration 27: add maturity_timestamp, delete_opstamp, node_id columns
  and publish_timestamp trigger to match the splits table (Paul's review)
- ListMetricsSplitsQuery: adopt FilterRange<i64> for time_range (matching
  log-side pattern), single time_range field for both read and compaction
  paths, add node_id/delete_opstamp/update_timestamp/create_timestamp/
  mature filters to close gaps with ListSplitsQuery
- Use SplitState enum instead of stringly-typed Vec<String> for split_states
- StoredMetricsSplit: add create_timestamp, node_id, delete_opstamp,
  maturity_timestamp so file-backed metastore can filter on them locally
- File-backed filter: use FilterRange::overlaps_with() for time range and
  window intersection, apply all new filters matching log-side predicate
- Postgres: intersection semantics for window queries, FilterRange-based
  SQL generation for all range filters
- Fix InsertableMetricsSplit.window_duration_secs from Option<i32> to i32
- Rename two-letter variables (ws, sf, dt) throughout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix rustfmt nightly formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(31): add shared invariants module to quickwit-dst

Extract duplicated invariant logic into a shared `invariants/` module
within `quickwit-dst`. This is the "single source of truth" layer in
the verification pyramid — used by stateright models, production
debug_assert checks, and (future) Datadog metrics emission.

Key changes:
- `invariants/registry.rs`: InvariantId enum (20 variants) with Display
- `invariants/window.rs`: shared window_start_secs(), is_valid_window_duration()
- `invariants/sort.rs`: generic compare_with_null_ordering() for SS-2
- `invariants/check.rs`: check_invariant! macro wrapping debug_assert
- stateright gated behind `model-checking` feature (optional dep)
- quickwit-parquet-engine uses shared functions and check_invariant!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(31): check invariants in release builds, add pluggable recorder

The check_invariant! macro now always evaluates the condition — not just
in debug builds. This implements Layer 4 (Production) of the verification
stack: invariant checks run in release, with results forwarded to a
pluggable InvariantRecorder for Datadog metrics emission.

- Debug builds: panic on violation (debug_assert, Layer 3)
- All builds: evaluate condition, call recorder (Layer 4)
- set_invariant_recorder() wires up statsd at process startup
- No recorder registered = no-op (single OnceLock load)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(31): wire invariant recorder to DogStatsD metrics

Emit cloudprem.pomsky.invariant.checked and .violated counters with
invariant label via the metrics crate / DogStatsD exporter at process
startup, completing Layer 4 of the verification stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: license headers + cfg(not(test)) for quickwit-dst and quickwit-cli

* chore: regenerate third-party license file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix rustfmt nightly formatting for quickwit-dst and quickwit-parquet-engine

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add CLAUDE.md and docs/internals architecture documentation

Ports CLAUDE.md (development guide, coding standards, known pitfalls) and
the full docs/internals tree including ADRs, gap analyses, TLA+ specs,
verification guides, style references, and compaction architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: split CLAUDE.md into repo context + opt-in /sesh-mode skill

- Move verification-first workflow (TLA+, DST, formal specs) to /sesh-mode skill
- Keep repo knowledge in CLAUDE.md (pitfalls, reliability rules, testing, docker, commands)
- Remove Crate Map (derivable from filesystem)
- Remove Coding Style bullet summary (CODE_STYLE.md is linked)
- Fix relative links in SKILL.md for .claude/skills/sesh-mode/ path

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add machete, cargo doc, and fmt details to CI checklist in CLAUDE.md

* review: parquet_file singular, proto doc link, fix metastore accessor

* style: fix rustfmt nightly comment wrapping in split metadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use plain code span for proto reference to avoid broken rustdoc link

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update quickwit/quickwit-parquet-engine/src/table_config.rs

Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>

* Update quickwit/quickwit-parquet-engine/src/table_config.rs

Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>

* style: rustfmt long match arm in default_sort_fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make parquet_file field backward-compatible in MetricsSplitMetadata

Pre-existing splits were serialized before the parquet_file field was
added, so their JSON doesn't contain it. Adding #[serde(default)]
makes deserialization fall back to empty string for old splits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: handle empty-column batches in accumulator flush

When the commit timeout fires and the accumulator contains only
zero-column batches, union_fields is empty and concat_batches fails
with "must either specify a row count or at least one column".
Now flush_internal treats empty union_fields the same as empty
pending_batches — resets state and returns None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: rustfmt check_invariant macro argument

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* revert: move code changes to separate PR

Reverts c8bf8d750, cafcac5e4, a088f5319 — these are code changes
(delete_metrics_splits error handling, doc comment tweaks) that
don't belong in a docs-only PR. They will land in a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* remove ADR-004 from upstream PR — moving to DataDog/pomsky

This ADR contains company-specific information and should live
in the private fork, not in the upstream quickwit-oss repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rebrand for upstream — remove Datadog/Pomsky/Quickhouse references

- Rewrite CLAUDE.md as generic Quickwit AI development guide
- Replace Quickhouse-Pomsky -> Quickwit branding across all docs
- Replace "Datadog" observability references with generic
  "production observability" language
- Remove "Husky (Datadog)" qualifier from gap docs (keep Husky
  citations — the blog post is public)
- Generalize internal knowledge (query rate numbers, product-specific
  lateness guarantees)
- Remove PomChi reference, private Google Doc link
- Add docs/internals/UPSTREAM-CANDIDATES.md for pomsky tracking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove ClickHouse references, track aspirational items

- Remove all ClickHouse/ClickStack references from gap docs and ADRs
  (keep Prometheus, Mimir, InfluxDB, Husky as prior art)
- Restore gap-005 Option C (compaction-time dedup) without ClickHouse citation
- Mark /sesh-mode reference in CLAUDE.md as aspirational
- Add aspirational items section to UPSTREAM-CANDIDATES.md tracking
  items described in docs but not yet implemented (TLA+ specs, DST,
  Kani, Bloodhound, performance baselines, benchmark binaries)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: fix aspirational items — TLA+ specs and Stateright models exist

UPSTREAM-CANDIDATES.md incorrectly stated TLA+ specs and Stateright
models don't exist. They do (contributed in #6246): ParquetDataModel.tla,
SortSchema.tla, TimeWindowedCompaction.tla, plus quickwit-dst invariants
and Stateright model tests. Updated to accurately reflect that the
remaining aspirational piece is the simulation infrastructure (SimClock,
FaultInjector, etc.).

Also removed the /sesh-mode aspirational entry — it's actively being
used and the underlying specs/models are real.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add .planning to .gitignore

Prevents GSD planning artifacts from being committed to the repository.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* revert: remove pomsky-specific Makefile changes

Reverts test env vars (CP_ENABLE_REVERSE_CONNECTION) and
load-cloudprem-ui target — these are pomsky-specific and
don't belong in upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add pitfall rule against silently swallowing unexpected state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Verdonk Lucas <lucas.verdonk@datadoghq.com>
George Talbot committed 5d ago
a92e7447508663a58e80b3efac133bfbe9fe513d
Parent: dc72e2f
Committed by GitHub <noreply@github.com> on 4/13/2026, 11:16:11 PM