# Storage Layer Architecture This document defines the storage layer for sudocode, inspired by the beads project's dual-storage pattern: human-editable files (Markdown) + machine-optimized cache (SQLite) with JSONL as the git-committed source of truth. ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ User Interaction │ │ (CLI commands, text editor, git operations) │ └────────────┬─────────────────────────────┬──────────────────┘ │ │ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ Markdown Files │ │ JSONL Files │ │ .sudocode/specs/│ │.sudocode/specs/ │ │ .sudocode/issues│ │.sudocode/issues/ │ │ │ │ │ │ ✓ Git-tracked │ │ ✓ Git-tracked │ │ ✓ Human-editable│ │ ✓ Source of │ │ ✓ Text diffs │ │ truth │ └─────────┬────────┘ └────────┬─────────┘ │ │ │ ┌──────────────────────┘ │ │ ▼ ▼ ┌──────────────────┐ │ SQLite Cache │ │ .sudocode/ │ │ cache.db │ │ │ │ ✗ Gitignored │ │ ✓ Fast queries │ │ ✓ Relationships │ └──────────────────┘ ``` ## Design Principles 1. **JSONL = Source of Truth**: Git-tracked, append-friendly, readable snapshots 2. **Markdown = Human Interface**: Primary editing surface, synced to JSONL 3. **SQLite = Query Cache**: Fast local operations, rebuilt from JSONL after git pull 4. **Auto-Sync**: Changes flow bidirectionally with debouncing 5. **Offline-First**: All operations work without network 6. **Git-Friendly**: Avoid binary files in commits, use text-based formats --- ## Storage Hierarchy ``` .sudocode/ ├── specs/ │ ├── auth-system.md # Markdown files (git-tracked) │ ├── database-design.md │ └── specs.jsonl # JSONL snapshot (git-tracked) ├── issues/ │ ├── issue-001.md # Markdown files (git-tracked) │ ├── issue-002.md │ └── issues.jsonl # JSONL snapshot (git-tracked) ├── meta.json # ID counters, config (git-tracked) ├── cache.db # SQLite cache (gitignored) └── .gitignore # Ignore *.db ``` --- ## SQLite Schema ### Core Tables #### 1. Specs Table ```sql CREATE TABLE specs ( id TEXT PRIMARY KEY, title TEXT NOT NULL CHECK(length(title) <= 500), file_path TEXT NOT NULL, -- Relative path to markdown file content TEXT NOT NULL DEFAULT '', -- Markdown content (no frontmatter) type TEXT NOT NULL DEFAULT 'feature', -- architecture|api|database|feature|research status TEXT NOT NULL DEFAULT 'draft', -- draft|review|approved|deprecated priority INTEGER NOT NULL DEFAULT 2 CHECK(priority >= 0 AND priority <= 4), created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, created_by TEXT NOT NULL, updated_by TEXT NOT NULL, parent_id TEXT, -- Optional parent spec FOREIGN KEY (parent_id) REFERENCES specs(id) ON DELETE SET NULL ); CREATE INDEX idx_specs_status ON specs(status); CREATE INDEX idx_specs_type ON specs(type); CREATE INDEX idx_specs_priority ON specs(priority); CREATE INDEX idx_specs_parent ON specs(parent_id); CREATE INDEX idx_specs_created_at ON specs(created_at); CREATE INDEX idx_specs_updated_at ON specs(updated_at); ``` #### 2. Issues Table ```sql CREATE TABLE issues ( id TEXT PRIMARY KEY, title TEXT NOT NULL CHECK(length(title) <= 500), description TEXT NOT NULL DEFAULT '', content TEXT NOT NULL DEFAULT '', -- Full markdown content (no frontmatter) status TEXT NOT NULL DEFAULT 'open', -- open|in_progress|blocked|needs_review|closed priority INTEGER NOT NULL DEFAULT 2 CHECK(priority >= 0 AND priority <= 4), issue_type TEXT NOT NULL DEFAULT 'task', -- bug|feature|task|epic|chore assignee TEXT, estimated_minutes INTEGER, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, closed_at DATETIME, created_by TEXT NOT NULL, parent_id TEXT, -- Optional parent issue FOREIGN KEY (parent_id) REFERENCES issues(id) ON DELETE SET NULL ); CREATE INDEX idx_issues_status ON issues(status); CREATE INDEX idx_issues_priority ON issues(priority); CREATE INDEX idx_issues_type ON issues(issue_type); CREATE INDEX idx_issues_assignee ON issues(assignee); CREATE INDEX idx_issues_parent ON issues(parent_id); CREATE INDEX idx_issues_created_at ON issues(created_at); CREATE INDEX idx_issues_updated_at ON issues(updated_at); CREATE INDEX idx_issues_closed_at ON issues(closed_at); ``` #### 3. Relationships Table (Polymorphic) Captures edges between any entities (spec→spec, issue→issue, spec→issue). ```sql CREATE TABLE relationships ( from_id TEXT NOT NULL, from_type TEXT NOT NULL, -- spec|issue to_id TEXT NOT NULL, to_type TEXT NOT NULL, -- spec|issue relationship_type TEXT NOT NULL, -- blocks|related|parent-child|discovered-from|implements created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, created_by TEXT NOT NULL, metadata TEXT, -- Optional JSON metadata PRIMARY KEY (from_id, from_type, to_id, to_type, relationship_type) ); CREATE INDEX idx_rel_from ON relationships(from_id, from_type); CREATE INDEX idx_rel_to ON relationships(to_id, to_type); CREATE INDEX idx_rel_type ON relationships(relationship_type); CREATE INDEX idx_rel_created_at ON relationships(created_at); ``` **Note**: We use a polymorphic relationship table instead of separate tables for spec_deps and issue_deps. This allows: - Flexible linking (e.g., issue implements spec) - Unified graph queries - Easier extension to future entity types #### 4. Tags Table (Shared) ```sql CREATE TABLE tags ( entity_id TEXT NOT NULL, entity_type TEXT NOT NULL, -- spec|issue tag TEXT NOT NULL, PRIMARY KEY (entity_id, entity_type, tag) ); CREATE INDEX idx_tags_entity ON tags(entity_id, entity_type); CREATE INDEX idx_tags_tag ON tags(tag); ``` #### 5. Events Table (Audit Trail) ```sql CREATE TABLE events ( id INTEGER PRIMARY KEY AUTOINCREMENT, entity_id TEXT NOT NULL, entity_type TEXT NOT NULL, -- spec|issue event_type TEXT NOT NULL, -- created|updated|status_changed|relationship_added|etc actor TEXT NOT NULL, old_value TEXT, -- JSON snapshot before change new_value TEXT, -- JSON snapshot after change comment TEXT, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, git_commit_sha TEXT -- Optional: git commit that caused this change ); CREATE INDEX idx_events_entity ON events(entity_id, entity_type); CREATE INDEX idx_events_type ON events(event_type); CREATE INDEX idx_events_actor ON events(actor); CREATE INDEX idx_events_created_at ON events(created_at); CREATE INDEX idx_events_git_commit ON events(git_commit_sha); ``` **Event Population Strategy:** The events table is populated through three mechanisms: 1. **Real-time (Immediate)**: When CLI commands execute, events are inserted in the same transaction ```typescript async function updateIssue(id: string, updates: Record): Promise { const tx = await db.beginTransaction(); try { const oldState = await db.getIssue(tx, id); await db.updateIssueInTx(tx, id, updates); await db.insertEventInTx(tx, { entityId: id, entityType: 'issue', eventType: 'updated', actor: getCurrentUser(), oldValue: JSON.stringify(oldState), newValue: JSON.stringify(updates), createdAt: new Date(), }); await tx.commit(); } catch (error) { await tx.rollback(); throw error; } } ``` 2. **Short History (Default)**: On import from JSONL, reconstruct recent events by diffing - Compare incoming JSONL state vs existing SQLite state - Generate synthetic events for detected changes - Limited to changes since last sync (typically recent work) ```typescript function reconstructShortHistory(newEntries: Entity[], oldEntries: Entity[]): Event[] { const events: Event[] = []; for (const newEntity of newEntries) { const old = findInOld(oldEntries, newEntity.id); if (!old) { events.push({ type: 'created', .../* entity fields */ }); } else if (hasChanges(old, newEntity)) { events.push({ type: 'updated', oldValue: JSON.stringify(old), newValue: JSON.stringify(newEntity), actor: newEntity.updatedBy, createdAt: newEntity.updatedAt, }); } } return events; } ``` 3. **Full History from Git (TODO - Future Enhancement)**: Parse git commit history of JSONL files - **Use case**: Multi-environment collaborative workflows need complete audit trail - **Approach**: Walk git log, diff JSONL files across commits, reconstruct full timeline - **Implementation considerations**: ```typescript // Future: sudocode history rebuild --from-git async function rebuildHistoryFromGit(): Promise { // 1. Walk git log for .jsonl file changes const commits = await git.log(['--follow', '--', '.sudocode/specs/specs.jsonl']); // 2. For each commit, diff JSONL lines for (const commit of commits) { const oldLines = await git.show([`${commit.parent}:specs.jsonl`]); const newLines = await git.show([`${commit.sha}:specs.jsonl`]); const changes = diffJSONL(oldLines, newLines); // 3. Generate events from diff for (const change of changes) { await db.insertEvent({ entityId: change.id, entityType: change.entityType, eventType: change.type, actor: commit.author, oldValue: change.before, newValue: change.after, createdAt: commit.timestamp, gitCommitSha: commit.sha, source: 'git-reconstructed', }); } } } ``` - **Challenges to solve later**: - Performance: Large repos with thousands of commits - Attribution: Git author ≠ actual change actor (multiple devs, agents) - Merge commits: How to attribute changes in merge conflicts? - Squashed/rebased history: May lose granular timeline - Initial import: One-time operation vs incremental updates **Trade-offs:** - **Real-time events**: Precise, but only for local changes - **Short history**: Fast, good enough for "what changed recently?" - **Full git history**: Complete audit trail, but complex and potentially slow **Recommended workflow:** 1. Start with real-time + short history reconstruction (phases 1-4) 2. Add git history reconstruction later when collaboration patterns emerge 3. Make it opt-in: `sudocode history rebuild --from-git --since 2025-01-01` ### Views for Common Queries #### Ready Work Views **Ready Specs** - Specs with no blocking dependencies: ```sql CREATE VIEW ready_specs AS SELECT s.* FROM specs s WHERE s.status IN ('draft', 'review') AND NOT EXISTS ( SELECT 1 FROM relationships r JOIN specs blocker ON r.to_id = blocker.id AND r.to_type = 'spec' WHERE r.from_id = s.id AND r.from_type = 'spec' AND r.relationship_type = 'blocks' AND blocker.status IN ('draft', 'review') ); ``` **Ready Issues** - Issues with no blocking dependencies: ```sql CREATE VIEW ready_issues AS SELECT i.* FROM issues i WHERE i.status = 'open' AND NOT EXISTS ( SELECT 1 FROM relationships r JOIN issues blocker ON r.to_id = blocker.id AND r.to_type = 'issue' WHERE r.from_id = i.id AND r.from_type = 'issue' AND r.relationship_type = 'blocks' AND blocker.status IN ('open', 'in_progress', 'blocked') ); ``` #### Blocked Issues View ```sql CREATE VIEW blocked_issues AS SELECT i.*, COUNT(r.to_id) as blocked_by_count, GROUP_CONCAT(r.to_id) as blocked_by_ids FROM issues i JOIN relationships r ON i.id = r.from_id AND r.from_type = 'issue' JOIN issues blocker ON r.to_id = blocker.id AND r.to_type = 'issue' WHERE i.status IN ('open', 'in_progress', 'blocked') AND r.relationship_type = 'blocks' AND blocker.status IN ('open', 'in_progress', 'blocked') GROUP BY i.id; ``` ### Database Configuration ```sql -- Enable WAL mode for better concurrency PRAGMA journal_mode=WAL; -- Enforce foreign keys PRAGMA foreign_keys=ON; -- Optimize for performance PRAGMA synchronous=NORMAL; PRAGMA temp_store=MEMORY; PRAGMA mmap_size=30000000000; PRAGMA page_size=4096; PRAGMA cache_size=10000; ``` --- ## JSONL Format ### Specs JSONL (`.sudocode/specs/specs.jsonl`) One spec per line, complete snapshot: ```json {"id":"spec-001","title":"Auth System","file_path":".sudocode/specs/auth-system.md","content":"# Auth System\n\n...","type":"architecture","status":"draft","priority":1,"created_at":"2025-10-16T10:00:00Z","updated_at":"2025-10-16T15:00:00Z","created_by":"alice","updated_by":"alice","parent_id":"spec-000","relationships":[{"from":"spec-001","to":"spec-002","type":"blocks"}],"tags":["auth","security"]} ``` **Fields**: - Core fields from `specs` table - `relationships`: Array of outgoing relationships - `tags`: Array of tag strings - `parent_id`: Optional parent spec ID ### Issues JSONL (`.sudocode/issues/issues.jsonl`) One issue per line: ```json {"id":"issue-001","title":"Implement OAuth endpoint","description":"Create REST endpoint...","content":"# Details\n\n...","status":"open","priority":1,"issue_type":"task","assignee":"agent-backend","estimated_minutes":120,"created_at":"2025-10-16T10:00:00Z","updated_at":"2025-10-16T15:00:00Z","closed_at":null,"created_by":"agent-planner","parent_id":null,"relationships":[{"from":"issue-001","to":"spec-001","type":"implements"},{"from":"issue-001","to":"issue-002","type":"blocks"}],"tags":["auth","backend"]} ``` **Fields**: - Core fields from `issues` table - `relationships`: Array of outgoing relationships - `tags`: Array of tag strings - `spec_refs`: Extracted from content (computed, not stored) --- ## Sync Mechanisms ### Three-Way Sync Flow ``` Markdown File <──┐ │ ├──> JSONL File <──> SQLite DB │ User Edit ───────┘ ``` ### Sync Operations #### 1. Markdown → JSONL → SQLite (User edits markdown) **Trigger**: File watcher detects `.md` file change **Process**: 1. Parse markdown file (frontmatter + content) 2. Extract ID from frontmatter 3. Load existing JSONL entry (if exists) 4. Merge changes (preserve relationships if not in frontmatter) 5. Update JSONL line (replace existing or append) 6. Update SQLite row 7. Insert event record **Implementation**: ```typescript async function syncMarkdownToJSONL(mdPath: string): Promise { // Parse markdown const spec = parseMarkdownFile(mdPath); // Load existing JSONL entry const existing = await loadFromJSONL(spec.id); // Merge relationships (preserve if not explicit in frontmatter) spec.relationships = mergeRelationships(existing, spec); // Update JSONL await updateJSONLLine(spec); // Update SQLite await db.updateSpec(spec); await db.insertEvent({ entityType: 'spec', entityId: spec.id, eventType: 'updated', actor: getCurrentUser(), }); } ``` #### 2. SQLite → JSONL → Markdown (CLI updates) **Trigger**: `sudocode` CLI command modifies database **Process**: 1. Update SQLite row 2. Insert event record 3. Queue export operation (debounced, 5 seconds) 4. On export: Read from SQLite, write to JSONL 5. Optionally update markdown frontmatter (user config) **Implementation**: ```typescript const exportQueue = newDebouncer(5000); // 5 seconds async function updateIssue(id: string, updates: Record): Promise { // Update SQLite await db.updateIssue(id, updates); await db.insertEvent({ entityType: 'issue', entityId: id, eventType: 'updated', actor: getCurrentUser(), }); // Queue export (debounced) exportQueue.trigger(async () => { await exportToJSONL(); if (config.syncMarkdownFrontmatter) { await syncFrontmatterFromDB(id); } }); } ``` #### 3. JSONL → SQLite (After git pull) **Trigger**: User runs `sudocode sync` or automatic after git operations **Process**: 1. Detect JSONL file changes (compare mtime or hash) 2. Load all JSONL entries 3. Compare with SQLite (detect additions, updates, deletions) 4. Apply changes with collision resolution 5. Optionally update markdown files **Implementation**: ```typescript async function importFromJSONL(path: string): Promise { // Read JSONL const entries = await readJSONL(path); // Load existing SQLite state const existing = await db.loadAll(); // Diff and resolve collisions let changes = detectChanges(entries, existing); const collisions = detectCollisions(changes); if (collisions.length > 0) { const resolved = resolveCollisions(collisions); changes = applyResolutions(changes, resolved); } // Apply changes const tx = await db.beginTransaction(); try { for (const change of changes) { switch (change.type) { case 'add': await db.insert(tx, change.entity); break; case 'update': await db.update(tx, change.entity); break; case 'delete': await db.delete(tx, change.entity); break; } } await tx.commit(); } catch (error) { await tx.rollback(); throw error; } } ``` ### Collision Resolution Strategy When importing JSONL after git merge, collisions may occur: 1. **ID Collision**: Same ID, different content - Score by reference count (count mentions in other entities) - Renumber entity with fewer references - Update all text references using regex `\b{old-id}\b` - Update relationship records - Log mapping: `{old-id} → {new-id}` 2. **Concurrent Updates**: Both sides updated same entity - Timestamp-based: newest wins - Or: Mark as conflict, require manual resolution 3. **Reference Breaks**: Referenced ID doesn't exist - Create placeholder entity (status: `missing`) - Or: Remove dangling reference --- ## File Watching ### Auto-Sync on File Changes **Options**: 1. **chokidar library** (Node.js): Watch `.sudocode/` directory for changes 2. **Polling**: Check file mtimes every 5 seconds 3. **Git hooks**: Sync on pre-commit and post-merge **Recommended**: chokidar + debouncing ```typescript import chokidar from 'chokidar'; const watcher = chokidar.watch(['.sudocode/specs', '.sudocode/issues'], { persistent: true, ignoreInitial: true, }); const debouncer = newDebouncer(2000); // 2 seconds watcher.on('change', (path) => { debouncer.trigger(async () => { await syncFile(path); }); }); watcher.on('add', (path) => { debouncer.trigger(async () => { await syncFile(path); }); }); ``` --- ## Export/Import CLI Commands ### sudocode export ```bash # Export SQLite → JSONL sudocode export # Export specs + issues to respective JSONL files sudocode export --specs # Export only specs sudocode export --issues # Export only issues sudocode export --output backup/ # Export to custom directory ``` ### sudocode import ```bash # Import JSONL → SQLite sudocode import # Import from .sudocode/specs.jsonl & issues.jsonl sudocode import --input backup/ # Import from custom directory sudocode import --resolve-collisions # Auto-resolve ID collisions sudocode import --dry-run # Show changes without applying ``` ### sudocode sync ```bash # Bi-directional sync sudocode sync # Detect changes, sync all directions sudocode sync --watch # Start file watcher, continuous sync sudocode sync --from-git # Import after git pull ``` --- ## Metadata File (`.sudocode/meta.json`) Stores ID counters and configuration: ```json { "version": "1.0", "next_spec_id": 43, "next_issue_id": 157, "id_prefix": { "spec": "spec-", "issue": "issue-" }, "last_sync": "2025-10-16T15:30:00Z", "collision_log": [ {"old_id": "issue-042", "new_id": "issue-158", "reason": "merge_conflict", "timestamp": "2025-10-16T15:00:00Z"} ] } ``` --- ## Performance Considerations ### Database Size **Expected Scale**: - 1000 specs → ~1MB JSONL, 5MB SQLite - 10000 issues → ~10MB JSONL, 50MB SQLite - 50000 relationships → ~5MB JSONL, 25MB SQLite **Query Performance**: - Ready work query: <10ms (indexed) - Dependency tree (depth 5): <50ms - Full-text search: <100ms (with FTS5) ### Sync Performance **Markdown → JSONL**: <10ms per file (parse + write) **JSONL → SQLite**: <1s for 10k entries (bulk insert) **Collision Detection**: <100ms for 10k entries ### Debouncing - File changes: 2-second debounce - CLI exports: 5-second debounce - Batch operations: No debounce (immediate export) --- ## Implementation Checklist ### Phase 1: Core Storage - [ ] Define SQLite schema (specs, issues, relationships, tags, events) - [ ] Create migration system (embed schema in binary) - [ ] Implement CRUD operations for specs and issues - [ ] Add indexes and views ### Phase 2: JSONL Sync - [ ] JSONL reader/writer - [ ] Export: SQLite → JSONL - [ ] Import: JSONL → SQLite - [ ] Collision detection and resolution ### Phase 3: Markdown Sync - [ ] Markdown parser (frontmatter + content) - [ ] Markdown writer (update frontmatter) - [ ] Markdown → JSONL sync - [ ] JSONL → Markdown sync ### Phase 4: Auto-Sync - [ ] File watcher (fsnotify) - [ ] Debouncing mechanism - [ ] Git hooks (pre-commit, post-merge) - [ ] `sudocode sync` command with watch mode ### Phase 5: Optimization - [ ] Batch operations - [ ] Transaction management - [ ] Query optimization - [ ] Full-text search (FTS5) ### Phase 6: Event History Reconstruction (Future) - [ ] Implement short history reconstruction (JSONL diff on import) - [ ] Design git history parser (walk commits, diff JSONL files) - [ ] Handle attribution challenges (git author vs actual actor) - [ ] Optimize for large repos (incremental parsing, caching) - [ ] Add `sudocode history rebuild --from-git` command - [ ] Support filtering by date range and entity type - [ ] Document performance characteristics and limitations --- ## Next Steps 1. Validate this storage design with the team 2. Create proof-of-concept: SQLite schema + basic CRUD 3. Implement JSONL export/import 4. Add markdown sync 5. Build auto-sync with file watching 6. Add event reconstruction from JSONL diffs (short history) 7. Defer git history reconstruction until multi-environment collaboration patterns emerge