docs: add CLAUDE.md, architecture docs, and /sesh-mode skill (#6247)
* feat: replace fixed MetricDataPoint fields with dynamic tag HashMap
* feat: replace ParquetField enum with constants and dynamic validation
* feat: derive sort order and bloom filters from batch schema
* feat: union schema accumulation and schema-agnostic ingest validation
* feat: dynamic column lookup in split writer
* feat: remove ParquetSchema dependency from indexing actors
* refactor: deduplicate test batch helpers
* lint
* feat(31): sort schema foundation — proto, parser, display, validation, window, TableConfig
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: rustdoc link errors — use backticks for private items
* feat(31): compaction metadata types — extend split metadata, postgres model, field lookup
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): wire TableConfig into sort path, add compaction KV metadata
Wire TableConfig-driven sort order into ParquetWriter and add
self-describing Parquet file metadata for compaction:
- ParquetWriter::new() takes &TableConfig, resolves sort fields at
construction via parse_sort_fields() + ParquetField::from_name()
- sort_batch() uses resolved fields with per-column direction (ASC/DESC)
- SS-1 debug_assert verification: re-sort and check identity permutation
- build_compaction_key_value_metadata(): embeds sort_fields, window_start,
window_duration, num_merge_ops, row_keys (base64) in Parquet kv_metadata
- SS-5 verify_ss5_kv_consistency(): kv_metadata matches source struct
- write_to_file_with_metadata() replaces write_to_file()
- prepare_write() shared method for bytes and file paths
- ParquetWriterConfig gains to_writer_properties_with_metadata()
- ParquetSplitWriter passes TableConfig through
- All callers in quickwit-indexing updated with TableConfig::default()
- 23 storage tests pass including META-07 self-describing roundtrip
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): PostgreSQL migration 27 + compaction columns in stage/list/publish
Add compaction metadata to the PostgreSQL metastore:
Migration 27:
- 6 new columns: window_start, window_duration_secs, sort_fields,
num_merge_ops, row_keys, zonemap_regexes
- Partial index idx_metrics_splits_compaction_scope on
(index_uid, sort_fields, window_start) WHERE split_state = 'Published'
stage_metrics_splits:
- INSERT extended from 15 to 21 bind parameters for compaction columns
- ON CONFLICT SET updates all compaction columns
list_metrics_splits:
- PgMetricsSplit construction includes compaction fields (defaults from JSON)
Also fixes pre-existing compilation errors on upstream-10b-parquet-actors:
- Missing StageMetricsSplitsRequestExt import
- index_id vs index_uid type mismatches in publish/mark/delete
- IndexUid binding (to_string() for sqlx)
- ListMetricsSplitsResponseExt trait disambiguation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): close port gaps — split_writer metadata, compaction scope, publish validation
Close critical gaps identified during port review:
split_writer.rs:
- Store table_config on ParquetSplitWriter (not just pass-through)
- Compute window_start from batch time range using table_config.window_duration_secs
- Populate sort_fields, window_duration_secs, parquet_files on metadata before write
- Call write_to_file_with_metadata(Some(&metadata)) to embed KV metadata in Parquet
- Update size_bytes after write completes
metastore/mod.rs:
- Add window_start and sort_fields fields to ListMetricsSplitsQuery
- Add with_compaction_scope() builder method
metastore/postgres/metastore.rs:
- Add compaction scope filters (AND window_start = $N, AND sort_fields = $N) to list query
- Add replaced_split_ids count verification in publish_metrics_splits
- Bind compaction scope query parameters
ingest/config.rs:
- Add table_config: TableConfig field to ParquetIngestConfig
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): final gap fixes — file-backed scope filter, META-07 test, dead code removal
- file_backed_index/mod.rs: Add window_start and sort_fields filtering
to metrics_split_matches_query() for compaction scope queries
- writer.rs: Add test_meta07_self_describing_parquet_roundtrip test
(writes compaction metadata to Parquet, reads back from cold file,
verifies all fields roundtrip correctly)
- fields.rs: Remove dead sort_order() method (replaced by TableConfig)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): correct postgres types for window_duration_secs and zonemap_regexes
Gap 1: Change window_duration_secs from i32 to Option<i32> in both
PgMetricsSplit and InsertableMetricsSplit. Pre-Phase-31 splits now
correctly map 0 → NULL in PostgreSQL, enabling Phase 32 compaction
queries to use `WHERE window_duration_secs IS NOT NULL` instead of
the fragile `WHERE window_duration_secs > 0`.
Gap 2: Change zonemap_regexes from String to serde_json::Value in
both structs. This maps directly to JSONB in sqlx, avoiding ambiguity
when PostgreSQL JSONB operators are used in Phase 34/35 zonemap pruning.
Gap 3: Add two missing tests:
- test_insertable_from_metadata_with_compaction_fields: verifies all 6
compaction fields round-trip through InsertableMetricsSplit
- test_insertable_from_metadata_pre_phase31_defaults: verifies pre-Phase-31
metadata produces window_duration_secs: None, zonemap_regexes: json!({})
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(31): add metrics split test suite to shared metastore_test_suite! macro
11 tests covering the full metrics split lifecycle:
- stage (happy path + non-existent index error)
- stage upsert (ON CONFLICT update)
- list by state, time range, metric name, compaction scope
- publish (happy path + non-existent split error)
- mark for deletion
- delete (happy path + idempotent non-existent)
Tests are generic and run against both file-backed and PostgreSQL backends.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): read compaction columns in list_metrics_splits, fix cleanup_index FK
* fix(31): correct error types for non-existent metrics splits
- publish_metrics_splits: return NotFound (not FailedPrecondition) when
staged splits don't exist
- delete_metrics_splits: succeed silently (idempotent) for non-existent
splits instead of returning FailedPrecondition
- Tests now assert the correct error types on both backends
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt metastore tests and postgres
* fix(31): address PR review — align metrics_splits with splits table
- Migration 27: add maturity_timestamp, delete_opstamp, node_id columns
and publish_timestamp trigger to match the splits table (Paul's review)
- ListMetricsSplitsQuery: adopt FilterRange<i64> for time_range (matching
log-side pattern), single time_range field for both read and compaction
paths, add node_id/delete_opstamp/update_timestamp/create_timestamp/
mature filters to close gaps with ListSplitsQuery
- Use SplitState enum instead of stringly-typed Vec<String> for split_states
- StoredMetricsSplit: add create_timestamp, node_id, delete_opstamp,
maturity_timestamp so file-backed metastore can filter on them locally
- File-backed filter: use FilterRange::overlaps_with() for time range and
window intersection, apply all new filters matching log-side predicate
- Postgres: intersection semantics for window queries, FilterRange-based
SQL generation for all range filters
- Fix InsertableMetricsSplit.window_duration_secs from Option<i32> to i32
- Rename two-letter variables (ws, sf, dt) throughout
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: fix rustfmt nightly formatting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): add shared invariants module to quickwit-dst
Extract duplicated invariant logic into a shared `invariants/` module
within `quickwit-dst`. This is the "single source of truth" layer in
the verification pyramid — used by stateright models, production
debug_assert checks, and (future) Datadog metrics emission.
Key changes:
- `invariants/registry.rs`: InvariantId enum (20 variants) with Display
- `invariants/window.rs`: shared window_start_secs(), is_valid_window_duration()
- `invariants/sort.rs`: generic compare_with_null_ordering() for SS-2
- `invariants/check.rs`: check_invariant! macro wrapping debug_assert
- stateright gated behind `model-checking` feature (optional dep)
- quickwit-parquet-engine uses shared functions and check_invariant!
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): check invariants in release builds, add pluggable recorder
The check_invariant! macro now always evaluates the condition — not just
in debug builds. This implements Layer 4 (Production) of the verification
stack: invariant checks run in release, with results forwarded to a
pluggable InvariantRecorder for Datadog metrics emission.
- Debug builds: panic on violation (debug_assert, Layer 3)
- All builds: evaluate condition, call recorder (Layer 4)
- set_invariant_recorder() wires up statsd at process startup
- No recorder registered = no-op (single OnceLock load)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): wire invariant recorder to DogStatsD metrics
Emit cloudprem.pomsky.invariant.checked and .violated counters with
invariant label via the metrics crate / DogStatsD exporter at process
startup, completing Layer 4 of the verification stack.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: license headers + cfg(not(test)) for quickwit-dst and quickwit-cli
* chore: regenerate third-party license file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: fix rustfmt nightly formatting for quickwit-dst and quickwit-parquet-engine
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add CLAUDE.md and docs/internals architecture documentation
Ports CLAUDE.md (development guide, coding standards, known pitfalls) and
the full docs/internals tree including ADRs, gap analyses, TLA+ specs,
verification guides, style references, and compaction architecture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: split CLAUDE.md into repo context + opt-in /sesh-mode skill
- Move verification-first workflow (TLA+, DST, formal specs) to /sesh-mode skill
- Keep repo knowledge in CLAUDE.md (pitfalls, reliability rules, testing, docker, commands)
- Remove Crate Map (derivable from filesystem)
- Remove Coding Style bullet summary (CODE_STYLE.md is linked)
- Fix relative links in SKILL.md for .claude/skills/sesh-mode/ path
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: add machete, cargo doc, and fmt details to CI checklist in CLAUDE.md
* review: parquet_file singular, proto doc link, fix metastore accessor
* style: fix rustfmt nightly comment wrapping in split metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use plain code span for proto reference to avoid broken rustdoc link
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update quickwit/quickwit-parquet-engine/src/table_config.rs
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
* Update quickwit/quickwit-parquet-engine/src/table_config.rs
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
* style: rustfmt long match arm in default_sort_fields
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: make parquet_file field backward-compatible in MetricsSplitMetadata
Pre-existing splits were serialized before the parquet_file field was
added, so their JSON doesn't contain it. Adding #[serde(default)]
makes deserialization fall back to empty string for old splits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: handle empty-column batches in accumulator flush
When the commit timeout fires and the accumulator contains only
zero-column batches, union_fields is empty and concat_batches fails
with "must either specify a row count or at least one column".
Now flush_internal treats empty union_fields the same as empty
pending_batches — resets state and returns None.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt check_invariant macro argument
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* revert: move code changes to separate PR
Reverts c8bf8d750, cafcac5e4, a088f5319 — these are code changes
(delete_metrics_splits error handling, doc comment tweaks) that
don't belong in a docs-only PR. They will land in a separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* remove ADR-004 from upstream PR — moving to DataDog/pomsky
This ADR contains company-specific information and should live
in the private fork, not in the upstream quickwit-oss repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: rebrand for upstream — remove Datadog/Pomsky/Quickhouse references
- Rewrite CLAUDE.md as generic Quickwit AI development guide
- Replace Quickhouse-Pomsky -> Quickwit branding across all docs
- Replace "Datadog" observability references with generic
"production observability" language
- Remove "Husky (Datadog)" qualifier from gap docs (keep Husky
citations — the blog post is public)
- Generalize internal knowledge (query rate numbers, product-specific
lateness guarantees)
- Remove PomChi reference, private Google Doc link
- Add docs/internals/UPSTREAM-CANDIDATES.md for pomsky tracking
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove ClickHouse references, track aspirational items
- Remove all ClickHouse/ClickStack references from gap docs and ADRs
(keep Prometheus, Mimir, InfluxDB, Husky as prior art)
- Restore gap-005 Option C (compaction-time dedup) without ClickHouse citation
- Mark /sesh-mode reference in CLAUDE.md as aspirational
- Add aspirational items section to UPSTREAM-CANDIDATES.md tracking
items described in docs but not yet implemented (TLA+ specs, DST,
Kani, Bloodhound, performance baselines, benchmark binaries)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: fix aspirational items — TLA+ specs and Stateright models exist
UPSTREAM-CANDIDATES.md incorrectly stated TLA+ specs and Stateright
models don't exist. They do (contributed in #6246): ParquetDataModel.tla,
SortSchema.tla, TimeWindowedCompaction.tla, plus quickwit-dst invariants
and Stateright model tests. Updated to accurately reflect that the
remaining aspirational piece is the simulation infrastructure (SimClock,
FaultInjector, etc.).
Also removed the /sesh-mode aspirational entry — it's actively being
used and the underlying specs/models are real.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: add .planning to .gitignore
Prevents GSD planning artifacts from being committed to the repository.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* revert: remove pomsky-specific Makefile changes
Reverts test env vars (CP_ENABLE_REVERSE_CONNECTION) and
load-cloudprem-ui target — these are pomsky-specific and
don't belong in upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add pitfall rule against silently swallowing unexpected state
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Verdonk Lucas <lucas.verdonk@datadoghq.com> G
George Talbot committed
a92e7447508663a58e80b3efac133bfbe9fe513d
Parent: dc72e2f
Committed by GitHub <noreply@github.com>
on 4/13/2026, 11:16:11 PM