Spot instance CI cost in 2026: how to get 70% off self-hosted runners
Spot instances are cloud computing's overlooked cost lever. AWS, GCP, and Azure all sell unused capacity at 60-90% off on-demand prices, with the catch that they can reclaim the capacity with 1-2 minutes notice. CI workloads are uniquely well-suited to spot because they are interrupt-tolerant by design: the orchestration layer (GitHub Actions, GitLab) detects a dropped runner and retries the job. This page covers the per-cloud spot pricing, the integration patterns for CI runner pools, and the few job categories that should stay on on-demand.
Headline at a glance (2026)
AWS Spot for compute-optimised families consistently delivers 65-75% discount on on-demand. GCP Spot delivers 70-91%. Azure Spot delivers 60-90%. For self-hosted CI runner pools above 200 hours/month of compute, switching to spot saves $50-500/month with minimal operational risk if the orchestration is set up to handle interruption.
AWS Spot pricing (2026 typical)
Spot prices are dynamic, regional, and AZ-specific. The numbers below are typical mid-2026 us-east-1 averages from the AWS Spot pricing page. Always check live pricing before committing to a particular instance family.
| Instance | vCPU/RAM | On-demand /hr | Spot typical /hr | Discount | CI fit |
|---|---|---|---|---|---|
| t3.medium | 2 / 4 | $0.0416 | $0.012 | 71% | Lint, small unit suites |
| m5.large | 2 / 8 | $0.096 | $0.029 | 70% | General-purpose builds |
| m5.xlarge | 4 / 16 | $0.192 | $0.058 | 70% | Mid-tier integration |
| c5.2xlarge | 8 / 16 | $0.34 | $0.10 | 71% | CPU-bound parallel tests |
| c5.4xlarge | 16 / 32 | $0.68 | $0.17 | 75% | Heavy monorepo, ML |
| c6g.xlarge (Arm) | 4 / 8 | $0.136 | $0.054 | 60% | Arm-native cost-sensitive |
| c7g.2xlarge (Arm) | 8 / 16 | $0.29 | $0.087 | 70% | Newer Arm, lower interruption |
| m6i.xlarge | 4 / 16 | $0.192 | $0.077 | 60% | Newer Intel, balanced workloads |
Interruption rates and instance selection
AWS publishes per-family interruption rates in the Spot Instance Advisor. Common families in 2026 typical us-east-1: c5/c6i 5-10% per month, m5/m6i 3-7%, c7g Graviton 2-5%, t3 family lower interruption due to lower demand for burstable compute. Larger instances within a family typically interrupt more often than smaller because the AWS reclaim algorithm prefers larger blocks of capacity.
For CI runner pools, the goal is to choose families with under 10% interruption rate. The cost of an interruption is the wasted partial compute plus the wait-time of the retry. At 5% interruption rate and 30-minute average build time, the average wasted compute is 5% x 15 minutes (assuming interruption mid-way) = 0.75 minutes per build, which is small. At 25% interruption rate the wasted compute exceeds the spot discount and on-demand becomes cheaper; this is rare for CI-suited families.
The retry-on-interrupt pattern
Ephemeral runner architectures (which we covered on the ephemeral runner cost page) handle spot interruption naturally. When a spot instance is reclaimed, the runner pod is killed. GitHub Actions or GitLab CI detects the runner disconnect, marks the job as failed with a retryable status, and queues a retry. The retry lands on a new pod on a new node, which may also be spot.
The integration glue: configure the workflow to retry on runner disconnect specifically (not on test failure). GitHub Actions: continue-on-error: false with retry annotations on the job, plus a workflow-level retry policy that distinguishes runner-loss from test-loss. GitLab CI: retry: { max: 2, when: runner_system_failure }. CircleCI: similar conditional retry semantics in the workflow definition.
Most teams set this up once on the runner controller and forget about it. The interruption rate is low enough that the retry cost is below 5% of total compute, well within the spot discount.
Karpenter spot-first node provisioning
The cleanest spot integration on Kubernetes is Karpenter with a node-pool configuration that prefers spot capacity and falls back to on-demand on spot exhaustion. Karpenter checks current spot pricing per AZ per family, picks the cheapest available combination, and provisions. When the spot capacity is reclaimed, Karpenter re-provisions on a different family or AZ.
Sample Karpenter NodePool configuration concept: capacity-type values [spot, on-demand], instance families restricted to low-interruption-rate sets, and a weight that prefers spot. Karpenter docs on node pools have current syntax.
Worked example: 50-dev team on spot
A 50-dev team running 50,000 build minutes/month (833 hours of compute) on m5.xlarge on-demand: 833 x $0.192 = $160/month. Same workload on spot at $0.058/hour: $48/month. Saving $112/month, or roughly $1,344/year on this single instance family.
Add a margin for retries: assume 5% interruption rate, 50% of interrupted jobs land on a worst-case 25% wasted compute. Effective overhead: 5% x 25% = 1.25%. Total spot cost: $48 x 1.0125 = $48.60. Saving still $111/month.
Scale this across a typical self-hosted CI runner pool with multiple instance families and the saving compounds. A team paying $800/month in self-hosted compute can typically reduce that to $250-300/month on a well-configured spot-first Karpenter setup, saving $500/month or $6,000/year. The setup cost is a few days of platform-engineering time. Payback measured in weeks.
When to keep workloads on on-demand
Three categories deserve on-demand. Release builds: a build that must complete before a release ceremony cannot tolerate retry latency. Pin these to on-demand or use a dedicated reserved-instance pool. Long-running integration tests (over 30 minutes): the probability of interruption rises with duration. A 60-minute test that interrupts at minute 50 wastes 50 minutes of compute, which often exceeds the spot saving. Jobs with non-idempotent side effects: deploy jobs that mutate production state, jobs that interact with paid third-party APIs, anything where re-running has cost beyond the compute. Run these on stable on-demand and accept the higher per-hour cost as insurance.
The hybrid pattern: a small on-demand pool for the high-stakes jobs, a larger spot pool for the bulk of PR-check workload. Most teams find 80-90% of CI minutes can safely live on spot, with the remaining 10-20% needing on-demand for reliability or speed reasons. The aggregate saving is still substantial.
Frequently Asked Questions
How much do AWS Spot instances cost?
AWS Spot pricing is dynamic and varies by instance family, region, and availability zone. Typical discounts in 2026: c5.xlarge spot $0.05/hour vs $0.17 on-demand (70% off). m5.large spot $0.029/hour vs $0.096 on-demand (70% off). c6g.large (Arm) spot $0.039/hour vs $0.068 on-demand (43% off). Compute-optimised x86 families consistently hit the highest spot discounts. Check current spot prices via the AWS Spot Instance Pricing page.
What is spot interruption?
AWS reclaims spot capacity with 2 minutes notice when on-demand demand rises or spot pricing exceeds your bid. The instance is terminated regardless of running workload. For CI, an in-progress build is killed; the workflow vendor (GitHub, GitLab) detects the runner disconnect and retries the job on a different runner. Properly configured CI handles spot interruption with no developer-visible impact beyond a slightly slower wall-clock for the affected job.
What is the typical spot interruption rate?
AWS publishes spot interruption rates per instance family per region in the Spot Instance Advisor. Common families in 2026: c5/c6i x86 compute typically 5-10% interruption rate per month, m5/m6i general-purpose 3-7%, c7g Graviton (Arm) 2-5%, larger instances higher rates than smaller. Use the Advisor when choosing CI runner instance types: pick families with under 10% rate and your CI experience is minimally affected.
What CI jobs should not run on spot?
Three categories. Release builds with deadlines: a job that must complete before a release ceremony cannot tolerate a 10-minute restart on spot interruption. Long-running integration tests (over 30 minutes): the probability of interruption rises with duration, and a 60-minute test that interrupts at minute 50 wastes a lot of compute. Jobs with non-idempotent side effects: anything that mutates external state and cannot safely re-run from scratch. Run these on on-demand or stable spot instance types only.
Do GCP and Azure offer spot equivalents?
Yes. GCP Preemptible VMs and Spot VMs (Spot is the newer, less-restrictive variant) offer 60-91% discounts. GCP Preemptible has a 24-hour maximum runtime; GCP Spot has no time limit. Azure Spot VMs offer up to 90% discounts with eviction notice times that vary. The cost economics are similar to AWS Spot; the integration patterns are similar (configure your Kubernetes node pools to prefer spot, fall back to on-demand on capacity exhaustion).