Add evaluate-prompt-portability template, protocol, and format (#239)
* Add evaluate-prompt-portability template, protocol, and format Adds three new PromptKit components for cross-LLM prompt portability evaluation (addresses #127 Phase 1): - Protocol: prompt-portability-evaluation — 7-phase claim-level consensus analysis methodology (output collection, claim extraction, semantic matching, consensus classification, divergence analysis, scoring, and hardening recommendations) - Format: portability-report — 9-section structured report covering evaluation context, per-model summaries, consensus core, majority claims, divergent claims (singular + contradictory), scorecard, hardening recommendations, and model notes - Template: evaluate-prompt-portability — interactive template that orchestrates fan-out execution across multiple LLM models, collects outputs, decomposes them into atomic semantic claims, performs cross-model consensus analysis, and produces a portability report Key design: comparison is semantic (claim-level), not textual. Two models producing the same assertions in different words score as Consensus. Contradictory claims (mutually exclusive assertions) are the highest-priority signal, traced to specific ambiguous prompt language with concrete rewrite recommendations. Complements the existing lint-prompt template — lint statically first, then evaluate empirically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review feedback: scoring, thresholds, error handling - Fix Majority threshold from >=50% to >50% to avoid ties with even model counts - Normalize portability score from [-1,1] to [0,1] via (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot produce negative scores - Define explicit Manual Review bucket for uncertain claim matches: excluded from scoring, reported in new Uncertain / Needs Review section in the portability-report format - Add fail-stop behavior when <2 models succeed: produce abbreviated report documenting failures instead of misleading partial analysis - Add UR- claim ID prefix to formatting rules Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add reference model sufficiency analysis for model selection Extends the portability evaluation with a new mode: when a reference model is designated, its claims become the ground-truth baseline and each cheaper model is scored on how well it reproduces that baseline. Protocol (Phase 8 - Model Sufficiency Analysis): - Per-model sufficiency rate = reproduced / total baseline claims - Missing claims classified as critical miss vs minor miss - Extra claims classified as valid addition / hallucination / noise - Three-tier sufficiency status: sufficient, conditionally sufficient, insufficient (based on threshold, critical misses, contradictions) - Identifies the minimum sufficient model (cheapest that meets threshold with zero critical misses) Format (Section 10 - Model Sufficiency Matrix): - Reference model + threshold display - Per-model sufficiency table with tier, rates, and status - Missing and extra claim detail tables - Cost-efficiency recommendation Template: - New params: reference_model (optional), sufficiency_threshold (default 90%) - Input validation ensures reference model is in the model list - Step 7 conditionally applies Phase 8 when reference model is set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix template Steps 4-5 for Manual Review bucket consistency Steps 4 and 5 were inconsistent with the protocol's Manual Review bucket rules for uncertain matches. Step 4 only flagged uncertain matches instead of placing them in a Manual Review bucket. Step 5 classified all clusters without excluding Manual Review clusters from scoring. Fixed: - Step 4: uncertain matches now placed in Manual Review bucket, excluded from scored classification, reported under Uncertain / Needs Review section - Step 5: classifies only non-Manual-Review clusters, includes Manual Review count in the summary presented to user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix heading nesting, section omission rule, and cluster type ambiguity - Fix 6c heading from ## to ### to match 6a/6b nesting under section 6 (Divergent Claims) - Section 10 (Model Sufficiency Matrix) now always included per the format's 'do not omit any section' rule, with a placeholder when no reference model is designated - Add canonical cluster type rule for scoring: when models assign different types to semantically matched claims, use the highest-weight type, break ties by majority, then arbiter decides. Original per-model types preserved for transparency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Alan Jowett <alan.jowett@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A
Alan Jowett committed
bb7babe3a029a539d3fe38efa9ebfa1f41a74744
Parent: 8363a14
Committed by GitHub <noreply@github.com>
on 4/9/2026, 6:04:57 PM