Skip to content

Analyze Performance Bottlenecks with AI

Updated 2026-06-08·advanced·Performance Testing

Reads a load test result summary (latency percentiles, throughput, error rate, system metrics) and returns a ranked list of suspected bottleneck layers — network, application, database, dependent service, or infrastructure — each with evidence cited from the metrics and a recommended next investigation step.

When to use it

  • A load test failed thresholds and you need direction before deep profiling.
  • Production p95 climbed and you need a hypothesis before opening every dashboard.
  • Comparing two load test runs to explain the regression.
  • Teaching engineers to read performance data — the prompt teaches the reasoning.

The prompt

XML-tagged — best for Claude 4.x

<role>
You are a performance engineer who has shipped capacity planning for high-scale systems. You generate hypotheses, not conclusions, because performance data is rarely unambiguous. Every claim cites the metric that supports it.
</role>

<context>
Bottleneck layers (typical order of investigation):
- **Network** — latency variance, packet loss, regional routing
- **Application** — CPU saturation, GC pauses, lock contention
- **Database** — connection pool exhaustion, slow queries, lock waits
- **Dependent service** — third-party latency, retries, rate limits
- **Infrastructure** — instance class limits, disk I/O, memory pressure

Each layer has signature metrics. Don't conclude from one signal; corroborate from multiple.
</context>

<task>
For the test results below:
1. Identify 2-4 candidate bottleneck layers with hypothesis per layer.
2. Rank by likelihood, citing the SPECIFIC metric(s) from the results that support each rank.
3. Recommend ONE next investigation step per hypothesis (what to measure, where to look).
4. Note ALL conflicting signals (metrics that contradict the top hypothesis).
5. Flag if data is insufficient to rank with confidence; name what additional data would help.
</task>

<input>
Test result summary (metrics): {results}
System architecture (services, dependencies): {architecture}
What you tested (load profile): {profile}
Recent changes (deploys, config): {changes}
</input>

<constraints>
- Never claim "the bottleneck is X" without 2+ supporting metrics.
- Rank by likelihood, not order of discovery.
- Conflicting signals are MANDATORY to surface — if everything aligns, the data is probably oversimplified.
- Next-step recommendations must be ONE step ("check connection pool utilization"), not a list ("debug everything").
- Avoid generic recommendations ("profile the application").
</constraints>

<output_format>
Four sections:
1. **Ranked hypotheses** — table: Rank | Layer | Evidence | Confidence (High / Medium / Low)
2. **Next steps** — table: Hypothesis | Next investigation step
3. **Conflicting signals** — bullet list (or "None observed" — but interrogate before saying that)
4. **Data gaps** — what's missing that would sharpen the analysis
</output_format>

Before writing, identify the FIRST point in the load profile where things degraded — the timing is the most informative signal.

Example

Common pitfalls

  • Model jumps to a single conclusion ('it's the database') without acknowledging the multiple possible causes.
  • Confidence levels get omitted; every hypothesis treated as equally likely. Re-prompt for ranking with confidence.
  • Conflicting signals are skipped — the model wants to be helpful. Force the section.
  • Recommendations get generic ('profile the application'). Demand specific, actionable next steps.

Tips

  • Feed the full timeline (when did each metric change?) — the order of degradation is the most informative signal.
  • Include 'recent changes' even if you don't think they're related — the model is good at finding correlations the human glossed over.
  • Pair with `bug-repro-from-logs` when you have application/HAR logs alongside the metrics.
  • Re-run as data comes in from the next-step investigation — the analysis sharpens with each iteration.

FAQ

App CPU has spikes (not steady high), p99 is much worse than p95 (long-tail latency from stop-the-world pauses), and there's no DB or network correlation. Check GC pause logs to confirm.

Related prompts