SIGN IN SIGN UP

Add evaluate-prompt-portability template, protocol, and format (#239)

* Add evaluate-prompt-portability template, protocol, and format

Adds three new PromptKit components for cross-LLM prompt portability
evaluation (addresses #127 Phase 1):

- Protocol: prompt-portability-evaluation — 7-phase claim-level
  consensus analysis methodology (output collection, claim extraction,
  semantic matching, consensus classification, divergence analysis,
  scoring, and hardening recommendations)

- Format: portability-report — 9-section structured report covering
  evaluation context, per-model summaries, consensus core, majority
  claims, divergent claims (singular + contradictory), scorecard,
  hardening recommendations, and model notes

- Template: evaluate-prompt-portability — interactive template that
  orchestrates fan-out execution across multiple LLM models, collects
  outputs, decomposes them into atomic semantic claims, performs
  cross-model consensus analysis, and produces a portability report

Key design: comparison is semantic (claim-level), not textual. Two
models producing the same assertions in different words score as
Consensus. Contradictory claims (mutually exclusive assertions) are
the highest-priority signal, traced to specific ambiguous prompt
language with concrete rewrite recommendations.

Complements the existing lint-prompt template — lint statically first,
then evaluate empirically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review feedback: scoring, thresholds, error handling

- Fix Majority threshold from >=50% to >50% to avoid ties with even
  model counts
- Normalize portability score from [-1,1] to [0,1] via
  (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot
  produce negative scores
- Define explicit Manual Review bucket for uncertain claim matches:
  excluded from scoring, reported in new Uncertain / Needs Review
  section in the portability-report format
- Add fail-stop behavior when <2 models succeed: produce abbreviated
  report documenting failures instead of misleading partial analysis
- Add UR- claim ID prefix to formatting rules

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add reference model sufficiency analysis for model selection

Extends the portability evaluation with a new mode: when a reference
model is designated, its claims become the ground-truth baseline and
each cheaper model is scored on how well it reproduces that baseline.

Protocol (Phase 8 - Model Sufficiency Analysis):
- Per-model sufficiency rate = reproduced / total baseline claims
- Missing claims classified as critical miss vs minor miss
- Extra claims classified as valid addition / hallucination / noise
- Three-tier sufficiency status: sufficient, conditionally sufficient,
  insufficient (based on threshold, critical misses, contradictions)
- Identifies the minimum sufficient model (cheapest that meets
  threshold with zero critical misses)

Format (Section 10 - Model Sufficiency Matrix):
- Reference model + threshold display
- Per-model sufficiency table with tier, rates, and status
- Missing and extra claim detail tables
- Cost-efficiency recommendation

Template:
- New params: reference_model (optional), sufficiency_threshold
  (default 90%)
- Input validation ensures reference model is in the model list
- Step 7 conditionally applies Phase 8 when reference model is set

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix template Steps 4-5 for Manual Review bucket consistency

Steps 4 and 5 were inconsistent with the protocol's Manual Review
bucket rules for uncertain matches. Step 4 only flagged uncertain
matches instead of placing them in a Manual Review bucket. Step 5
classified all clusters without excluding Manual Review clusters
from scoring.

Fixed:
- Step 4: uncertain matches now placed in Manual Review bucket,
  excluded from scored classification, reported under Uncertain /
  Needs Review section
- Step 5: classifies only non-Manual-Review clusters, includes
  Manual Review count in the summary presented to user

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix heading nesting, section omission rule, and cluster type ambiguity

- Fix 6c heading from ## to ### to match 6a/6b nesting under
  section 6 (Divergent Claims)
- Section 10 (Model Sufficiency Matrix) now always included per
  the format's 'do not omit any section' rule, with a placeholder
  when no reference model is designated
- Add canonical cluster type rule for scoring: when models assign
  different types to semantically matched claims, use the
  highest-weight type, break ties by majority, then arbiter decides.
  Original per-model types preserved for transparency.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Alan Jowett <alan.jowett@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A
Alan Jowett committed
bb7babe3a029a539d3fe38efa9ebfa1f41a74744
Parent: 8363a14
Committed by GitHub <noreply@github.com> on 4/9/2026, 6:04:57 PM