How do I interpret long-tail latency (high p99)?

Long tails come from queuing or rare slow paths. Queue depth, GC, lock contention, slow query outliers all cause them. Look at the gap between p95 and p99: if p99 is 4x p95, something is wrong in the tail.

When are the results inconclusive?

When 3+ layers all show stress signals at the same time. That's usually a sign of cascading failure (one bottleneck causes others). Re-test at a lower load to see what breaks first.

Can I skip metrics and just look at logs?

No. Logs tell you what happened to individual requests; metrics tell you the pattern across thousands. Bottleneck analysis is statistical — you need the metrics view.

Analyze Performance Bottlenecks with AI

Updated 2026-06-08·advanced·Performance Testing

Reads a load test result summary (latency percentiles, throughput, error rate, system metrics) and returns a ranked list of suspected bottleneck layers — network, application, database, dependent service, or infrastructure — each with evidence cited from the metrics and a recommended next investigation step.

When to use it

A load test failed thresholds and you need direction before deep profiling.
Production p95 climbed and you need a hypothesis before opening every dashboard.
Comparing two load test runs to explain the regression.
Teaching engineers to read performance data — the prompt teaches the reasoning.

The prompt

XML-tagged — best for Claude 4.x

<role>
You are a performance engineer who has shipped capacity planning for high-scale systems. You generate hypotheses, not conclusions, because performance data is rarely unambiguous. Every claim cites the metric that supports it.
</role>

<context>
Bottleneck layers (typical order of investigation):
- **Network** — latency variance, packet loss, regional routing
- **Application** — CPU saturation, GC pauses, lock contention
- **Database** — connection pool exhaustion, slow queries, lock waits
- **Dependent service** — third-party latency, retries, rate limits
- **Infrastructure** — instance class limits, disk I/O, memory pressure

Each layer has signature metrics. Don't conclude from one signal; corroborate from multiple.
</context>

<task>
For the test results below:
1. Identify 2-4 candidate bottleneck layers with hypothesis per layer.
2. Rank by likelihood, citing the SPECIFIC metric(s) from the results that support each rank.
3. Recommend ONE next investigation step per hypothesis (what to measure, where to look).
4. Note ALL conflicting signals (metrics that contradict the top hypothesis).
5. Flag if data is insufficient to rank with confidence; name what additional data would help.
</task>

<input>
Test result summary (metrics): {results}
System architecture (services, dependencies): {architecture}
What you tested (load profile): {profile}
Recent changes (deploys, config): {changes}
</input>

<constraints>
- Never claim "the bottleneck is X" without 2+ supporting metrics.
- Rank by likelihood, not order of discovery.
- Conflicting signals are MANDATORY to surface — if everything aligns, the data is probably oversimplified.
- Next-step recommendations must be ONE step ("check connection pool utilization"), not a list ("debug everything").
- Avoid generic recommendations ("profile the application").
</constraints>

<output_format>
Four sections:
1. **Ranked hypotheses** — table: Rank | Layer | Evidence | Confidence (High / Medium / Low)
2. **Next steps** — table: Hypothesis | Next investigation step
3. **Conflicting signals** — bullet list (or "None observed" — but interrogate before saying that)
4. **Data gaps** — what's missing that would sharpen the analysis
</output_format>

Before writing, identify the FIRST point in the load profile where things degraded — the timing is the most informative signal.

Example

Common pitfalls

Model jumps to a single conclusion ('it's the database') without acknowledging the multiple possible causes.
Confidence levels get omitted; every hypothesis treated as equally likely. Re-prompt for ranking with confidence.
Conflicting signals are skipped — the model wants to be helpful. Force the section.
Recommendations get generic ('profile the application'). Demand specific, actionable next steps.

Tips

Feed the full timeline (when did each metric change?) — the order of degradation is the most informative signal.
Include 'recent changes' even if you don't think they're related — the model is good at finding correlations the human glossed over.
Pair with `bug-repro-from-logs` when you have application/HAR logs alongside the metrics.
Re-run as data comes in from the next-step investigation — the analysis sharpens with each iteration.

FAQ

App CPU has spikes (not steady high), p99 is much worse than p95 (long-tail latency from stop-the-world pauses), and there's no DB or network correlation. Check GC pause logs to confirm.

Related prompts

Performance Testingintermediate

Design a Load Test Strategy

Returns a load test strategy covering 5 scenario types (baseline / load / stress / spike / soak) with thresholds for response time, throughput, and error rate, environment requirements, monitoring checkpoints, and pass/fail criteria — and explicit environment-parity statement.

Open →

Performance Testingintermediate

Generate k6 Test Script from Endpoint

Reads an endpoint description and returns a ready-to-run k6 script with `options.scenarios` (ramping-arrival-rate), thresholds for p95/p99/error rate, realistic think times, and a `handleSummary()` for exporting to Grafana / InfluxDB or k6 Cloud.

Open →

Bug Triage & Reportingadvanced

Reproduce a Bug from Logs

Reads application logs, HAR files, or browser console excerpts and reconstructs a step-by-step reproduction recipe with timestamps, the failing request, suspected preconditions, and a confidence flag per inferred step.

Open →

Bug Triage & Reportingintermediate

Root Cause Analysis (5 Whys + Fishbone)

Given a defect description, returns a literal 5 Whys chain, a fishbone diagram (text representation) categorizing contributing factors into People / Process / Technology / Environment, and a list of preventive measures with named owners — never generic recommendations.

Open →