Section: Session 2: Outage Avoidance: The First 72 Hours Before Failure | Reliability First: Building Resilient, Secure, and Cost-Efficient Systems | ATRC

Section outline

- Select activity Anatomy of preventable outages: configuration drift, missed patches, capacity blind spots
  
  Anatomy of preventable outages: configuration drift, missed patches, capacity blind spots Page
- Select activity Proactive monitoring vs. reactive firefighting: defining early-warning thresholds
  
  Proactive monitoring vs. reactive firefighting: defining early-warning thresholds Page
- Select activity Pre-mortems: simulating failures before they happen (Chaos Engineering Lite for SMBs)
  
  Pre-mortems: simulating failures before they happen (Chaos Engineering Lite for SMBs) Page
- Select activity Communication protocols: who to alert—and when—before a system degrades
  
  Communication protocols: who to alert—and when—before a system degrades Page
- Select activity Cost of downtime vs. cost of prevention: making the case for reliability investments
  
  Cost of downtime vs. cost of prevention: making the case for reliability investments Page
- Select activity Key roles during near-misses: sysadmin, DevOps, security, and business continuity leads
  
  Key roles during near-misses: sysadmin, DevOps, security, and business continuity leads Page