Question 1

What does an AI feature test plan include?

Accepted Answer

A 2026 AI feature test plan includes: a golden evaluation set (typically 100+ examples), success metric (LLM-as-judge, exact match, or hybrid), latency/cost/hallucination budgets, red-team scenarios (jailbreak, prompt extraction, indirect injection), prompt-injection tests, privacy and data handling rules, fallback behaviour for model errors, monitoring stack with alert thresholds, retraining/maintenance cadence, and rollout/rollback signals.

Question 2

What is a hallucination budget?

Accepted Answer

A hallucination budget is the maximum acceptable percentage of factually incorrect outputs on your golden evaluation set. Typical budgets are under 2% for factual products, under 5% for general assistants. If a deployment violates the budget, the feature should not ship or should be rolled back. Without an explicit budget, 'hallucination tolerance' becomes whatever ships.

Question 3

What red-team scenarios should I test?

Accepted Answer

At minimum: jailbreak attempts (DAN-style 'ignore previous instructions'), prompt extraction (revealing the system prompt), indirect injection via user-provided documents, system prompt override, function-call misuse, PII extraction, misinformation generation. Each scenario should have explicit pass criteria — typically 'refuses and explains' or 'returns to safe default behaviour'.

Question 4

How do I test prompt injection?

Accepted Answer

Build a regression suite of injection attempts: 'ignore previous instructions,' hidden instructions in pasted user content, Markdown/HTML tag confusion, role-confusion via fake conversation turns. Run them nightly in CI. Detect failures via pattern matching plus a classifier. Treat any new successful injection as a P1 — they tend to be transferable to other prompts in your product.

Question 5

What is a golden evaluation set?

Accepted Answer

A curated dataset of input/expected-output pairs that represents the real distribution of your feature's traffic. Typically 100+ examples for a focused feature, 500+ for general capabilities. Update it monthly with examples from real failures. Run the golden set nightly in CI; block deploys that regress more than N percentage points.

Question 6

Should I document fallback behaviour?

Accepted Answer

Yes. When the LLM errors, times out, or returns malformed output, the user should see graceful degradation — not an unhandled exception or a broken UI. Document: retry policy (typically once), then degrade to a deterministic rule-based response or a 'this feature is temporarily unavailable' banner. Test the fallback path in staging.

AI Feature Test Plan (2026)

Eval Setup

Quality Budgets

Why AI features need their own test plan

The golden evaluation set — your regression baseline

Hallucination budget — the explicit tolerance

Red-team scenarios — adversarial testing

Prompt-injection regression suite

Monitoring stack

Fallback behaviour — the most-forgotten section

Rollout and rollback signals

How this plan pairs with the LLM Prompt Test Case template

LLM Prompt Test Case

Test Strategy

Quality Goals & SLOs

QA RFC / ADR