Failed build rerun cost in 2026: what flaky tests cost in compute and time

Every retry on a failed build runs the pipeline again. The CI bill rises by the percentage of builds that retry, sometimes more if the retry compounds: a flake on retry triggers another retry. The compute cost is small. The hidden bill is developer-time spent investigating flakes, which is consistently larger and ignored on most cost dashboards. This page quantifies the rerun line item using public data on flake rates and the same payroll anchors as the wait-time analysis.

Headline at a glance (2026)

A 10% flake rate at 25 developers pushing 5 times each per day adds about 11% to your monthly compute spend (typically $50-200) and about $5,000-10,000 in developer-time investigation cost. The deflake project that drops the rate to 2% pays back in days; the harder challenge is the engineering discipline to actually do it.

What "flake rate" actually measures

Flake rate is the percentage of CI runs that fail and then pass on retry without any code changes. The denominator is total runs, not just failed runs. A 10% flake rate means one in ten runs flakes; a 5% rate is one in twenty. Different teams measure it differently: some count test-case retries (each individual test that retries successfully), some count job retries (an entire job that re-runs successfully), some count workflow retries (a complete pipeline rerun). The job-level rate is the one most CI vendor dashboards expose by default.

Industry data on typical rates is consistent. The Snyk State of Open Source Security and similar surveys show 5-15% flake rates as common in browser-test-heavy and integration-test-heavy codebases. Pure unit test suites typically run under 2% once they are mature. Newer codebases tend toward higher flake rates as test infrastructure stabilises. A 10% rate is common for web teams; a 0.5-2% rate is the achievable target for teams that invest in flake reduction.

The compute cost of reruns

A 10% flake rate means roughly one extra retry for every ten runs, or 11% more total runs. If your monthly compute bill is $1,000, the flake-driven inflation is $110/month. At a 5% rate it would be $50. At a 1% rate, $10.

Worse cases happen when the retry retries: a flake on retry triggers another retry. Multiple retries are bounded by your CI workflow config (most teams cap at 2-3) but each one compounds the spend. A flake-prone branch with a 30% effective failure rate after three retries is a 0.7^3 = 34% chance of all three failing, costing 3x the original run on the worst case. A small percentage of branches consume a large percentage of the rerun budget; flake-prone tests are a tail-distribution problem.

The developer-time cost

Investigation time is the larger line. Every flake requires at minimum: read the failure, decide whether it is a real bug or a flake, click retry. Three minutes if you are lucky, fifteen if the failure looks plausible enough to start debugging. Average per-flake investigation across realistic teams: 5-12 minutes.

Worked example: 25 developers, 5 pushes/dev/day, 20 working days = 2,500 pipeline runs/month. At 10% flake rate, 250 reruns. At 8 minutes average investigation per flake, 33 hours/month of investigation. At $100/hour fully loaded, $3,300/month in payroll value. Plus the 11% compute increase ($110/month on a $1,000 bill). Total flake-driven cost: about $3,400/month.

The same workload at 1% flake rate: 25 reruns/month, 3.3 hours of investigation, $330 in payroll plus $10 compute = $340 total. Dropping flake rate from 10% to 1% saves roughly $3,000/month in this team. The economic case for a deflake project is straightforward: if you can land it in two weeks of one engineer's time (worth maybe $4,000 fully loaded) and the saving is $3,000/month, it pays back in 5-6 weeks.

Auto-retry: the tempting wrong answer

The most common "solution" to flake is auto-retry of failed jobs at the CI workflow level. GitHub Actions continue-on-error: true with a retry step. CircleCI retry: { max_retries: 3 }. GitLab's retry: 2. Auto-retry hides flakes from developers, lets them merge their PR, and keeps the team velocity up.

The reason this is a wrong answer in the long run: hidden flakes accumulate. Every test that flakes occasionally and gets auto-retried becomes a test that flakes more often six months later. The compute cost rises (each retry is real spend). The signal-to-noise ratio degrades because real failures are sometimes auto-retried into apparent passes. Eventually the test suite is no longer a reliable signal of code quality, which defeats the point of having one.

Better pattern: auto-retry only at the test-case level (not the whole job), with a max retries of 2, and emit a metric per-test for retry rate. Use the metric to drive a quarantine queue: any test with retry rate above 5% over two weeks gets moved out of the blocking suite into a non-blocking nightly job, with a tracking ticket. Quarantined tests get fixed or deleted within a sprint. The metric must be visible (a dashboard, a Slack post on threshold breach) or the discipline does not stick.

The deflake patterns that work

Three patterns consistently reduce flake rate. First, isolate test state. Tests that share global state (database rows, files on disk, environment variables) flake when they run in different orders. Per-test fresh state (per-test database transactions that roll back, ephemeral test directories, dependency injection of test doubles) eliminates the largest source of flake.

Second, eliminate timing dependencies. Tests that rely on setTimeout, polling, or assumptions about which thread runs first are intrinsically flaky. Use deterministic time (clock injection in your test framework), explicit synchronisation primitives, and condition-based waits ("wait until visible" not "wait 500ms"). Browser test frameworks like Playwright and Cypress have this baked in; older Selenium-based suites are the canonical flaky test.

Third, mock external dependencies aggressively. Tests that hit a real network endpoint, even an internal one, will flake when the endpoint is slow or unavailable. Mock at the boundary, contract-test the boundary in a separate suite that runs less often. The Snyk research and other industry data consistently identify network-touching tests as the largest single category of flake.

Tooling: when to buy vs build

Several commercial tools detect flakes automatically and can quarantine them: Trunk Flaky Tests, BuildPulse, and Datadog CI Visibility. Pricing is typically $20-50/seat/month. For a 25-developer team that is $500-1,250/month, which is roughly the same as the compute cost of the flakes the tool would help eliminate. The buy decision is whether your team has the bandwidth to maintain the discipline manually; if not, the tool pays for itself by enforcing the quarantine workflow.

Build it yourself: a simple cron job that queries your CI vendor's API for same-SHA reruns, aggregates flake rate per test case, and posts to a dashboard or Slack channel. A weekend project that gives 80% of the value for $0 ongoing. Most teams that try this once never look at the dashboard again, which is why the commercial tools exist.

Frequently Asked Questions

What is a flaky test?

A flaky test is one that produces different results on identical inputs without code changes. It passes sometimes and fails sometimes for reasons unrelated to the code under test: timing, ordering, network, shared state, randomness. Flake is distinct from a real bug because re-running on the same commit produces a different result. Industry research consistently puts the median flake rate for a mature codebase at 0.5-2% of test runs, with younger or browser-test-heavy codebases at 5-15%.

What does a CI rerun cost?

A rerun costs the same compute as the original run plus the developer-time cost of investigating the failure. For a 12-minute Linux build at GitHub Actions $0.006/min, the compute cost of one rerun is $0.072. The developer-time cost is much larger: even 5 minutes of investigation at $100/hour fully loaded is $8.33. So the all-in cost of a rerun is dominated by developer time, not compute. At a 10% flake rate, a team of 25 developers running 200 builds/day pays roughly $40-60 in compute and $1,500-2,500 in developer time per month on reruns alone.

How do I measure my flake rate?

Compare CI run outcomes for the same commit. If you re-run a failed build without changes and it passes, that is a flake. Tools like Datadog CI Visibility, Trunk Flaky Tests, BuildPulse, and Devstats all auto-detect flakes by tracking same-commit re-runs. Manual measurement: query your CI vendor's API for runs with status=failure followed by status=success on the same SHA within 24 hours. Most teams discover their flake rate is higher than they thought.

Should I auto-retry failed CI jobs?

Conditionally. Auto-retry of the entire build hides flakes and lets the underlying problem fester, while doubling the compute spend on flaky branches. Better: auto-retry only individual failed test cases (not the whole job), with a maximum retry count of 2-3, and emit a metric for retry rate per test case. The metric becomes input to a quarantine workflow: tests that retry above a threshold get quarantined out of the blocking suite into a non-blocking nightly job, where they get fixed or deleted.

What is the cost of a 10% flake rate?

At a 10% flake rate, roughly one in ten builds fails for non-code reasons, requiring a rerun. Compute cost rises 11% (1.1x runs per push). Developer-time cost rises more steeply because each flake interrupts the developer's attention: investigate the failure, decide it is a flake, retry. Conservative estimate per developer: 15-30 minutes/week of flake-investigation overhead. For a 25-dev team at $100/hr that is $5,000-10,000/month in payroll value, plus 11% on the compute bill.