Skip to content

AI Feature Test Plan (2026)

Eval golden set · hallucination budget · red-team · prompt-injection · monitoring · rollout — for shipping LLM features safely.

✓ Auto-saved to this browser · works offline · nothing leaves your device

Eval Setup

Quality Budgets

Why AI features need their own test plan

Traditional software is deterministic: same input, same output. You write a test, the test passes or fails, the result is reproducible. LLM-backed features are probabilistic: same input can produce different outputs, the quality drifts as the underlying model is updated, and the failure modes (hallucinations, prompt injection, cost runaways) don't map cleanly to classic pass/fail testing. Forcing AI features through a traditional test plan misses what actually breaks in production: hallucinations land in user-facing copy, injected prompts leak confidential context, latency spikes during peak hours, costs balloon when a prompt change increases output tokens by 30%.

The golden evaluation set — your regression baseline

A golden set is a curated dataset of input/expected-output pairs that represents the real distribution of your feature's traffic. For a focused feature (e.g., classify support tickets), 100 examples is a starting baseline. For a general-purpose assistant, 500+. Update the set monthly with examples from real failures — every shipped bug adds an entry. Run the set nightly in CI; alert on regressions beyond a threshold (typically 5 percentage points).

Eval methods, in order of cost:

Hallucination budget — the explicit tolerance

What percentage of outputs are allowed to be factually wrong on the golden set? "Below 2%" is a reasonable starting budget for factual products. "Below 5%" for general assistants with clear "verify with documentation" disclaimers. The number you pick will be debated and that is the point — without an explicit budget, hallucination tolerance becomes whatever ships.

Red-team scenarios — adversarial testing

Standard red-team scenarios to cover for any LLM feature:

Each scenario has explicit pass criteria — typically "refuses and explains" or "returns to safe default behaviour." New successful red-team finds become regression tests.

Prompt-injection regression suite

Maintain a regression suite of prompt-injection attempts: hidden instructions in pasted content, Markdown/HTML confusion, fake conversation turns, role-confusion. Run nightly. Detect failures via pattern matching plus a classifier. Treat any new successful injection as P1 — the same vector usually transfers to other prompts in your product.

Monitoring stack

Three layers of monitoring:

Each layer feeds explicit alert thresholds: p95 latency above 3s for 5 minutes pages on-call; daily spend above 110% of budget notifies; golden-set regression above 5 percentage points blocks deploys.

Fallback behaviour — the most-forgotten section

When the LLM errors, times out, or returns malformed output, what does the user see? Document and test the path: retry once, then degrade to a deterministic rule-based response and show a "temporarily unavailable" banner. Verify the fallback works in staging by forcing failures. The first time you see a real failure in production should not be the first time you've tested the fallback.

Rollout and rollback signals

Phased rollout for any non-trivial AI feature: internal dogfood (1 week) → 5% canary → 25% → 100%. Feature flag controls. Kill switch tested weekly.

Go-live signals: golden set ≥ baseline minus tolerance, red-team scenarios all mitigated, latency p95 within budget, no Critical safety issues, PII redaction verified. Rollback signals: golden set drops more than 5pp, error rate above 5%, spend above 150% baseline, any Critical safety incident. The decision becomes mechanical instead of political.

How this plan pairs with the LLM Prompt Test Case template

The AI Feature Test Plan is feature-level; the LLM Prompt Test Case is per-prompt. One feature usually contains 1–5 prompts; each gets its own test case with version, input/expected pairs, forbidden patterns, baseline score, regression threshold. Use both: feature plan for the broader contract; prompt test case for nightly regression.