Skip to content

Estimate Testing Effort from User Stories with AI

Updated 2026-06-08·intermediate·Test Strategy

Reads a backlog of user stories and returns testing effort estimates broken down by activity (test design, execution, automation, reporting) with a confidence level (high/medium/low) and a sanity-check flag when total testing exceeds 50% of dev estimate.

When to use it

  • You're planning a sprint or quarter and need defensible QA effort numbers.
  • You're estimating capacity to negotiate scope with product.
  • You're calibrating against historical data to see which stories you underestimated.
  • You're a fractional QA lead estimating across multiple teams.

The prompt

XML-tagged — best for Claude 4.x

<role>
You are a QA planning lead. Your estimates are explicit about confidence, broken down by activity, and called out when they exceed reasonable proportions of development effort.
</role>

<context>
Estimating "QA effort" as a single number is useless. Break into 4 buckets: test design, test execution (manual), automation authoring, reporting/communication. Each story gets a confidence level. Sanity check: if total QA effort exceeds 50% of dev estimate, flag for re-discussion — the story is likely too big or too risky.
</context>

<task>
For each user story:
1. Estimate effort (in hours OR story points — match the user's unit) per activity bucket.
2. Sum the activities into a total QA estimate.
3. Assign a confidence level (High / Medium / Low) based on similarity to historical work and clarity of acceptance criteria.
4. Flag stories where QA estimate > 50% of dev estimate (only when dev estimate is provided).
5. Identify the largest source of uncertainty per story.
</task>

<input>
Story list (with brief description, dev estimate if available): {stories}
Unit (hours or story points): {unit}
Team context (automation maturity, historical velocity): {context}
</input>

<constraints>
- Four activity buckets, all four populated per story even if some are 0.
- Confidence: High (similar work shipped recently, criteria clear), Medium (some uncertainty), Low (novel work or vague criteria).
- Flag emoji when QA estimate > 50% of dev: `(!)` or text `HIGH RATIO`.
- Identify ONE source of uncertainty per story (don't list five for everything).
- Total at the bottom: sum across all stories.
</constraints>

<output_format>
Two sections:
1. **Estimation table** — Story | Design | Execution | Automation | Reporting | Total | Confidence | Uncertainty | Flag.
2. **Notes** — 2-3 sentences on assumptions and the largest aggregate uncertainty.
</output_format>

Before writing, identify which stories are most ambiguous so confidence levels are honest.

Example

Common pitfalls

  • Model rounds to whole days when asked for hours, losing precision.
  • Confidence defaults to 'Medium' across the board — push back if it isn't varied per story.
  • HIGH RATIO flag gets omitted unless dev estimates are present AND the constraint is in the prompt.
  • Uncertainty defaults to 'requirements unclear' for everything — generic, useless; require specificity.

Tips

  • Provide dev estimates whenever possible — the HIGH RATIO flag is the most valuable output, and it requires them.
  • Re-run quarterly with actual hours-spent data from last quarter; calibration improves dramatically.
  • Use this to negotiate scope — if total exceeds team capacity, the table makes it visible which stories to cut or descope.
  • Pair with `risk-based-testing-prioritization` — high-risk stories may deserve more QA hours than estimation alone suggests.

FAQ

Whichever your team uses for dev estimates. Mixing units defeats the comparison. Hours are more direct for QA (sprint commitments, capacity planning). Points are more natural in teams that have stable point velocity.

Related prompts