How do I find flaky tests at scale?

Most CI systems track flake rate (GitHub Actions Test Insights, CircleCI Flaky Tests). Sort by failure rate over the last 30 days; anything > 2% is flaky enough to investigate. Don't wait for someone to complain.

Can the same prompt fix multiple tests at once?

No — feed one test at a time. Batch fixing usually misclassifies causes because tests share files and the model averages signals across them.

What about flaky-by-design tests (deliberately probabilistic)?

Make them deterministic for CI (seed the random source) and run a separate non-deterministic suite outside the merge gate. Don't let probabilistic tests block PRs.

Refactor a Flaky Test to Stable

Updated 2026-06-08·advanced·Test Automation

Takes a flaky test and its failure history, identifies which of the canonical root causes (race, hard sleep, shared state, network dependency, ordering, animation) is responsible, and produces a rewritten test that fixes the specific cause — no blanket retries.

When to use it

A test fails ~5–30% of the time in CI and you don't know why.
You want to root-cause flakiness before deciding to retry or quarantine.
You're rewriting a legacy suite with `waitForTimeout` everywhere and want consistent patterns.
You want to teach junior engineers how to think about flake causes, not just retry counts.

The prompt

XML-tagged — best for Claude 4.x

<role>
You are a test reliability engineer. You believe retries hide bugs, not fix them. Before rewriting a flaky test, you identify its root cause by category — race, hard sleep, shared state, network dependency, ordering, or animation — and you address THAT cause, not a generic retry.
</role>

<context>
Canonical flake categories:
1. **Race** — Test asserts state before the event that produces it completes.
2. **Hard sleep** — `waitForTimeout` or thread.sleep guesses at duration; passes on slow CI, fails on fast.
3. **Shared state** — Test depends on data/setup left by another test; order-dependent.
4. **Network dependency** — Test hits a real external service; intermittent failures.
5. **Ordering** — Test runs first or last and behaves differently.
6. **Animation / debounce** — Element exists in DOM but isn't actionable yet (mid-animation, debounced handler not flushed).
</context>

<task>
For the test code and failure history I provide:
1. Identify ONE canonical root cause from the categories above. Cite the specific signal in the code that maps to it.
2. Rewrite the test addressing that cause:
   - Race → replace with `expect(locator).toBeVisible()` or similar event-driven wait.
   - Hard sleep → replace with auto-waiting expect.
   - Shared state → add `test.beforeEach` setup and ensure teardown.
   - Network → stub via `page.route()` or Playwright APIRequestContext mocking.
   - Ordering → make test fully self-contained (own data, own auth).
   - Animation → wait for animation end or use CSS to disable in test mode.
3. Explicitly REJECT adding retries (`test.describe.serial`, `test.retry`) unless the cause is truly environmental and not in the test.
</task>

<input>
Test code: {test_code}
Failure pattern / history: {failures}
Framework version (Playwright X.Y, Cypress X.Y): {framework}
</input>

<constraints>
- Identify exactly ONE root cause; if you see multiple, pick the dominant and note the others.
- DIAGNOSIS BEFORE REWRITE — never produce a rewrite without naming the cause.
- Do not add `test.retry()` or test-level retries.
- Do not introduce `waitForTimeout` in the fix.
- Keep the test's intent identical; reviewers should not see scope changes.
</constraints>

<output_format>
Three sections:
1. **Root cause** — Category name + 1-2 sentences citing the line in the original code.
2. **Rewritten test** — Code block.
3. **Why this fix works** — 2-3 sentences explaining what the change does and why retries weren't needed.
</output_format>

Read the test code carefully before declaring a cause.

Example

Common pitfalls

Model identifies multiple causes and rewrites ambiguously — force ONE dominant cause.
Rewrite swaps `waitForTimeout` for `.waitFor()` instead of an `expect(...)` auto-wait — `.waitFor()` is just another hard wait wearing a costume.
Network-stub fix introduces new bugs if the stub schema drifts from real responses — mention in 'Why this works' that schema parity must be maintained.
Model adds extra assertions while rewriting (scope creep) — keep the same assertions, only change the wait strategy.

Tips

Paste the actual failure log if available — it often contains the smoking gun (e.g., timeout duration, last successful state).
Run the rewritten test 20 times locally before claiming it's stable; flake is statistical.
Pair with `review-test-code-anti-patterns` to find the same class of flake across the suite.
If the suite has many hard sleeps, batch-fix them via `sync-waits-to-auto-waiting` instead of one by one.

FAQ

When the root cause is genuinely outside your test — e.g., a third-party CDN with intermittent timeouts, or shared test infrastructure that occasionally hiccups. Retries are wrong when the cause is in YOUR code or YOUR test. The categorization above is for the second case.

Related prompts

Test Automationintermediate

Generate Playwright Page Object Model

Give the model a page description plus a list of UI elements and it returns a complete Page Object Model in TypeScript using Playwright's auto-waiting locators (getByRole / getByTestId), typed action and assertion methods, and a page-level fixture.

Open →

Test Code Reviewintermediate

Review Test Code for Anti-Patterns

Reads a test file and returns a categorized list of anti-patterns — hard sleeps, shared mutable state, weak assertions (`toBeTruthy` instead of `toEqual`), missing teardown, mixed setup/assertion concerns — each with line numbers, severity, and a suggested fix.

Open →

Test Code Reviewintermediate

Convert Synchronous Waits to Auto-Waiting

Reads a test using hard waits and returns a rewritten version using Playwright auto-waiting (`expect(locator).toBeVisible()`, `toHaveText()`, `toHaveCount()`) — justifies each replacement by what state the original was waiting for, preserves the test's intent.

Open →

Test Automationadvanced

Convert Manual Test Cases to Playwright

Reads manual test steps (Action / Expected Result) and produces a Playwright spec with locator suggestions, action method calls (assuming a POM exists), assertions matching expected results, and explicit `// MANUAL:` comments where automation can't replicate human judgment.

Open →