Course: Reliability First: Building Resilient, Secure, and Cost-Efficient Systems

Section outline

Select section General

Collapse Expand
General

Collapse all Expand all
- Select activity Announcements
  
  Announcements Forum
- Select activity Course Outline
  
  Course Outline Page
Select section Session 1: The System Admin Talent Gap & Operational Resilience

Collapse Expand
Session 1: The System Admin Talent Gap & Operational Resilience
- Select activity The growing global shortage of skilled system administrators and its impact on uptime
  
  The growing global shortage of skilled system administrators and its impact on uptime Page
- Select activity Why diverse skill sets (automation, networking, security, cloud) matter in modern sysadmin teams
  
  Why diverse skill sets (automation, networking, security, cloud) matter in modern sysadmin teams Page
- Select activity Retention challenges in high-pressure infrastructure roles—and how to mitigate burnout
  
  Retention challenges in high-pressure infrastructure roles—and how to mitigate burnout Page
- Select activity Leveraging open-source training and commercial stuff to standardize skills
  
  Leveraging open-source training and commercial stuff to standardize skills Page
- Select activity Building internal talent pipelines through mentoring, documentation, and cross-training
  
  Building internal talent pipelines through mentoring, documentation, and cross-training Page
Select section Session 2: Outage Avoidance: The First 72 Hours Before Failure

Collapse Expand
Session 2: Outage Avoidance: The First 72 Hours Before Failure
- Select activity Anatomy of preventable outages: configuration drift, missed patches, capacity blind spots
  
  Anatomy of preventable outages: configuration drift, missed patches, capacity blind spots Page
- Select activity Proactive monitoring vs. reactive firefighting: defining early-warning thresholds
  
  Proactive monitoring vs. reactive firefighting: defining early-warning thresholds Page
- Select activity Pre-mortems: simulating failures before they happen (Chaos Engineering Lite for SMBs)
  
  Pre-mortems: simulating failures before they happen (Chaos Engineering Lite for SMBs) Page
- Select activity Communication protocols: who to alert—and when—before a system degrades
  
  Communication protocols: who to alert—and when—before a system degrades Page
- Select activity Cost of downtime vs. cost of prevention: making the case for reliability investments
  
  Cost of downtime vs. cost of prevention: making the case for reliability investments Page
- Select activity Key roles during near-misses: sysadmin, DevOps, security, and business continuity leads
  
  Key roles during near-misses: sysadmin, DevOps, security, and business continuity leads Page
Select section Session 3: The Compliance Mirage in Infrastructure Management

Collapse Expand
Session 3: The Compliance Mirage in Infrastructure Management
- Select activity Why “passing uptime audits” ≠ real resilience (e.g., ticking boxes on backup checks but never testing restores)
  
  Why “passing uptime audits” ≠ real resilience (e.g., ticking boxes on backup checks but never testing restores) Page
- Select activity Case studies: compliant systems that failed catastrophically due to overlooked dependencies
  
  Case studies: compliant systems that failed catastrophically due to overlooked dependencies Page
- Select activity The hidden risk of “it’s always worked this way” thinking in legacy environments
  
  The hidden risk of “it’s always worked this way” thinking in legacy environments Page
- Select activity Moving beyond ISO 27001/ITIL checklists: asking “What breaks if this server dies right now?”
  
  Moving beyond ISO 27001/ITIL checklists: asking “What breaks if this server dies right now?” Page
- Select activity Cultivating a culture of operational humility: blameless post-mortems, shared runbooks, and continuous improvement
  
  Cultivating a culture of operational humility: blameless post-mortems, shared runbooks, and continuous improvement Page
Select section Session 4: The Hidden Costs of Technical & Reliability Debt

Collapse Expand
Session 4: The Hidden Costs of Technical & Reliability Debt
- Select activity What is reliability debt? (Unpatched OSes, manual deployments, undocumented systems, stale DNS records)
  
  What is reliability debt? (Unpatched OSes, manual deployments, undocumented systems, stale DNS records) Page
- Select activity How reliability debt silently inflates costs: emergency fixes, slower deployments, security gaps
  
  How reliability debt silently inflates costs: emergency fixes, slower deployments, security gaps Page
- Select activity Calculating TCO of “quick fixes” vs. sustainable automation (Ansible, Terraform, monitoring-as-code)
  
  Calculating TCO of “quick fixes” vs. sustainable automation (Ansible, Terraform, monitoring-as-code) Page
- Select activity Prioritizing modernization: which legacy systems pose the highest risk per dollar spent
  
  Prioritizing modernization: which legacy systems pose the highest risk per dollar spent Page
- Select activity Making the business case: ROI of proactive maintenance, automation, and secure-by-default configurations
  
  Making the business case: ROI of proactive maintenance, automation, and secure-by-default configurations Page

Section outline

General

Session 1: The System Admin Talent Gap & Operational Resilience

Session 2: Outage Avoidance: The First 72 Hours Before Failure

Session 3: The Compliance Mirage in Infrastructure Management

Session 4: The Hidden Costs of Technical & Reliability Debt