reliability
Error budget
Error budget — the slice of allowed unavailability under an SLO (e.g. 99.9 % over 30 days = ~43 minutes of allowed downtime).
Definition
The error budget is (1 − SLO target) × time window. It is the absolute amount of unavailability (or latency violation, error-rate violation, …) allowed before the SLO is breached. The complement is uptime; what matters operationally is how fast the budget is being consumed.
In Maxoperf
Use error budgets to decide whether a performance regression is acceptable. If a load test shows latency or error rates that would consume the budget too quickly, the release should pause even if average response time still looks healthy.
Common pitfalls
- Picking a 99.99 % target without the budget to back it up — every minute of unavailability is a 1 % monthly hit.
- Treating the budget as a target floor rather than a budget to spend on releases, planned maintenance, and migrations.
FAQ
How is the error budget calculated?
(1 − SLO target) × window. 99.9 % over 30 days = ~43 min; 99.95 % over 30 days = ~22 min.
What is a burn-rate alert?
An alert that fires when the budget is being consumed faster than allowed for the window — e.g. burning 1 hour of budget in 15 minutes is a critical pager event even if the dashboard still says "99.9 %".