[Avg. reading time: 8 minutes]
High Availability
High Availability refers to how much uptime (availability) a system guarantees over a period — usually per year.
It’s expressed using “nines” — like 99%, 99.9%, etc. More 9’s = Less downtime.
High Availability – Nines and Downtime
| Availability | name | Allowed Downtime per Year | Per Month | Use Case Example |
|---|---|---|---|---|
| 99% | “Two nines” | ~3.65 days | ~7.2 hours | Small apps, dev/test environments |
| 99.9% | “Three nines” | ~8.76 hours | ~43.8 mins | Basic web services |
| 99.99% | “Four nines” | ~52.6 minutes | ~4.38 mins | Payment systems, APIs |
| 99.999% | “Five nines” | ~5.26 minutes | ~26.3 seconds | Medical, Telco, IoT control loops |
| 99.9999% | “Six nines” | ~31.5 seconds | ~2.63 seconds | Mission-critical systems |
For IoT
- Smart Home Light Bulb → 99% is okay (a few hours of downtime is fine)
- Smart Grid Control System → 99.999% is essential (every second counts)
- Medical IoT (e.g., Heart Monitor) → Needs high availability
Beyond Just Nines
| Concept | Why It Matters in IoT + Cloud |
|---|---|
| Redundancy | Backup sensors, edge nodes, and cloud instances ensure system keeps running if one fails |
| Failover Systems | Automatically switch to standby components during failure |
| Load Balancing | Spreads traffic across devices or cloud zones to prevent overload |
| Latency vs Availability | A service may be “up” but still slow — availability ≠ performance |
| Disaster Recovery (DR) | Ensures systems and data can recover from outages or disasters |
| Geographic Distribution | Spreading across regions/availability zones improves uptime and resilience |
| SLA (Service Level Agreement) | Understand what cloud vendors promise and what downtime you’re actually allowed |
| Edge Processing | Enables critical operations to continue even if cloud is unreachable (e.g., AWS Greengrass) |
| Monitoring & Alerting | Detect and respond to failures fast using tools like CloudWatch, Datadog, Prometheus |
| Cost vs HA Tradeoff | Higher availability usually means higher costs — design smart based on use case |
Fun Discussion Pointers
To design each system, do we need Edge computing or Fog computing, should we go to Cloud if so how many 9’s we need.
- How many 9’s we need for smart light switch at home?
- How many 9’s we need for smart light switch at Bank ATM?
- A temperature sensor on a cold-storage truck is sending data to the cloud.
- You’re designing an IoT wearable for elderly patients that detects falls. What should be the design?
- What happens if the MQTT broker goes down? How would you make it fault-tolerant?
- A weather station publishes sensor data every 15 minutes. Do they need Highly Available system?