
Designing Highly Available Systems: Best Practices & Real‑World Patterns
Learn how to design highly available systems with fault tolerance, redundancy, load balancing, and disaster recovery. Step‑by‑step guides, code samples, and architecture patterns for 99.99% uptime.
Introduction
In today's always‑on world, downtime is not just an inconvenience—it’s a revenue killer and a brand‑damaging event. Highly available (HA) systems are engineered to stay operational despite hardware failures, network glitches, or software bugs. Designing such systems is a blend of architecture principles, operational discipline, and pragmatic coding. This guide walks you through the core concepts, architectural patterns, and practical code snippets you need to build services that meet 99.9% (or higher) uptime SLAs.
What Is High Availability?
High availability means a system can continue to provide its core functionality when one or more of its components fail. It is measured by availability – the percentage of time a service is operational over a given period.
Availability = (Total Time – Downtime) / Total Time × 100%
| SLA Tier | Allowed Downtime per Year | Typical Use‑Case |
|---|---|---|
| 99% | 3.65 days | Internal tools |
| 99.9% | 8.76 hours | SaaS apps |
| 99.99% | 52.6 minutes | Financial services |
| 99.999% | 5.26 minutes | Critical infrastructure |
Achieving the higher tiers requires redundancy, fault isolation, rapid recovery, and continuous monitoring.
Core Principles of HA Design
1. Redundancy
Duplicate critical components (servers, databases, network paths) so that a failure of one does not impact the whole system. Redundancy can be active‑active (both instances serve traffic) or active‑passive (standby takes over).
2. Fault Isolation
Limit the blast radius of a failure. Use micro‑services, circuit breakers, and bounded contexts so that a bug in one service doesn’t cascade.
3. Statelessness
Stateless services can be scaled horizontally and replaced instantly. Persist state in external stores (databases, caches) that are themselves HA.
4. Graceful Degradation
When capacity is low, degrade non‑essential features rather than failing completely. Example: serve cached pages while the database recovers.
5. Automated Recovery
Human‑in‑the‑loop recovery adds latency. Implement health checks, auto‑restart, and self‑healing orchestration (e.g., Kubernetes) to bring services back online within seconds.
Architectural Patterns for HA
2‑Tier vs. Multi‑Tier
A classic 2‑tier app (web + DB) can be HA‑enabled, but a multi‑tier architecture (load balancer → API layer → services → data layer) offers more granular redundancy and scaling.
Load Balancing
Distribute traffic across multiple instances. Choose a Layer‑4 (TCP) or Layer‑7 (HTTP) balancer based on protocol needs.
Example: NGINX Load Balancer
http {
upstream app_servers {
server app1.example.com max_fails=3 fail_timeout=30s;
server app2.example.com max_fails=3 fail_timeout=30s;
server app3.example.com max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://app_servers;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
}
}
}
The max_fails and fail_timeout directives help the balancer mark unhealthy nodes and route around them.
Active‑Active Replication
Both replicas serve read/write traffic. Requires conflict‑free replication (e.g., CRDTs, multi‑master databases like CockroachDB). This eliminates a single point of failure but adds complexity.
Active‑Passive (Hot Standby)
A primary handles traffic; a secondary stays in sync via streaming replication. Failover is triggered by health checks. Simpler to implement with traditional RDBMS (PostgreSQL streaming replication, MySQL Group Replication).
Circuit Breaker
Protect downstream services from overload. When failures exceed a threshold, the breaker trips and returns a fallback response.
Example: JavaScript Circuit Breaker (using opossum)
const opossum = require('opossum');
function fetchUser(id) {
return fetch(`https://api.example.com/users/${id}`).then(res => res.json());
}
const breaker = new opossum(fetchUser, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 10000,
});
breaker.fallback(() => ({ id: null, name: 'Anonymous' }));
breaker.fire(123).then(console.log).catch(console.error);
When the API is slow or failing, the circuit breaker quickly returns the fallback instead of queuing up requests.
Data Layer HA Strategies
1. Database Replication
- Primary‑Replica: Write to primary, read from replicas. Use read‑scale and failover.
- Multi‑Master: Write to any node; conflict resolution required.
PostgreSQL Streaming Replication Example
# On primary
postgres -D /var/lib/postgresql/data &
psql -c "CREATE ROLE replica LOGIN REPLICATION PASSWORD 'replica_pass';"
# Create replication slot
psql -c "SELECT * FROM pg_create_physical_replication_slot('replica_slot');"
# On replica
pg_basebackup -h primary.example.com -D /var/lib/postgresql/data -U replica -P --slot=replica_slot --wal-method=stream
postgres -D /var/lib/postgresql/data &
2. Sharding
Distribute data across multiple nodes to avoid a single large failure. Pair sharding with replica sets for each shard.
3. Distributed Caches
Use HA caches like Redis Sentinel or Cluster mode. Sentinel monitors master health and promotes a replica automatically.
Redis Sentinel Config
sentinel monitor mymaster 10.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
Network & Infrastructure Redundancy
- Multiple Availability Zones (AZs) – Deploy instances across AZs within a cloud region. AZs are isolated power and network domains.
- Multi‑Region Deployment – For disaster recovery, replicate data to a secondary region and use DNS failover (e.g., Route 53 health checks).
- Anycast DNS – Serve DNS from many edge locations to reduce latency and improve resilience.
Monitoring, Alerting, and Chaos Engineering
Health Checks
- Liveness – Is the process running?
- Readiness – Can it serve traffic?
- Startup – Does it need warm‑up?
Kubernetes example:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Metrics & Alerting
Collect latency percentiles, error rates, CPU/memory, and queue depth. Use Prometheus + Alertmanager or cloud‑native equivalents.
# Prometheus rule for 5xx error spike
- alert: HighHttp5xxRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "5xx error rate > 5%"
description: "Investigate upstream service or database latency."
Chaos Engineering
Validate HA by intentionally injecting failures (e.g., Netflix Chaos Monkey, Gremlin). Test:
- Instance termination
- Network latency spikes
- AZ outage
- Database replication lag
Step‑by‑Step Blueprint: Building an HA REST API
- Provision Infrastructure
- 2 × VPC subnets in different AZs.
- Auto‑Scaling Groups (ASG) with minimum 2 instances per AZ.
- Application Load Balancer (ALB) spanning AZs.
- Deploy Stateless API Service
- Containerize with Docker.
- Use Kubernetes Deployment with
replicas: 4andpodAntiAffinityto spread pods.
- Add Data Layer
- Primary‑Replica PostgreSQL cluster (managed service) with automatic failover.
- Redis Cluster (3 master + replicas) for session cache.
- Implement Circuit Breaker & Retries
- Wrap DB calls with exponential back‑off.
- Configure Health Checks
/healthzreturns 200 only if DB & Redis connections succeed.
- Set Up Monitoring
- Export metrics via
/metrics(Prometheus format). - Dashboard with Grafana showing request latency, DB replication lag, pod restarts.
- Export metrics via
- Enable Disaster Recovery
- Daily snapshots of DB to a secondary region.
- DNS failover record pointing to a standby ALB in the DR region.
- Run Chaos Tests
- Terminate a pod, verify traffic reroutes.
- Simulate AZ loss, ensure ALB redirects to remaining AZ.
- Introduce 200 ms latency on DB, verify circuit breaker trips.
Common Pitfalls & How to Avoid Them
| Pitfall | Impact | Remedy |
|---|---|---|
| Single‑point load balancer | Entire service down if LB fails | Use cloud‑managed LB with regional redundancy |
| Storing session state locally | Session loss on pod restart | External session store (Redis) or JWT tokens |
| Ignoring DNS TTL | Slow failover | Set low TTL (30‑60 s) for critical records |
| Over‑provisioning without testing | Cost waste, false confidence | Conduct regular chaos experiments |
| Synchronous cross‑region calls | Latency spikes, cascading failures | Asynchronous queues or read‑only replicas |
Conclusion
Designing highly available systems is not a one‑time checklist; it’s an ongoing discipline that blends redundant architecture, stateless design, automated recovery, and continuous validation. By applying the principles, patterns, and code examples in this guide—load balancing, active‑active replication, circuit breakers, robust monitoring, and chaos engineering—you can confidently aim for 99.99%+ uptime and protect your users from the costly effects of downtime.
Remember: Availability is a product of good design and diligent operations. Keep testing, keep measuring, and iterate your architecture as traffic grows and failure modes evolve.