Humans of Reliability
Episodes
The Golden Hour: Why the First 15 Minutes of an Incident Decide Everything w/ Gandhi M. N. Kumar (Twillio)
28 Apr 2026
Contributed by Lukas
Most incident response advice focuses on tools, alerts, and post-mortems. Gandhi Mathi Nathan Kumar, Principal Incident Commander at Twilio, with 14 y...
From 600 to 6,000: Federating Incident Response w/ Cliff Snyder (ex-LinkedIn SRE)
22 Apr 2026
Contributed by Lukas
A centralized SRE team of 600 engineers as the first line of defense for every incident works - until the business asks you to spread that responsibil...
AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)
09 Apr 2026
Contributed by Lukas
Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-F...
Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)
31 Mar 2026
Contributed by Lukas
The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the...
The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)
17 Mar 2026
Contributed by Lukas
Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis...
Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend
04 Mar 2026
Contributed by Lukas
Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mount...
Code Is Cheap, Reliability Isn’t: Owning Production in the AI era w/ Swizec Teller
16 Feb 2026
Contributed by Lukas
Code has never been easier to write. With AI copilots and agentic coding tools, spinning up features feels almost effortless. But production systems d...
Democratizing Reliability: Empowering Non-Devs with Dileshni Jayasinghe (commonsku)
14 Jan 2026
Contributed by Lukas
Many companies don’t invest in incident management until something goes wrong. commonsku took a different path.In this episode of Humans of Reliabil...
99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Tomás Hernando Koffman (Not Diamond)
22 Dec 2025
Contributed by Lukas
Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated a...
The Reality of GenAI in Production with Eduardo Ordax (AWS)
12 Dec 2025
Contributed by Lukas
GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually sto...
It’s Never Different This Time: LLM Reliability Without the Hype with Julien Simon
19 Nov 2025
Contributed by Lukas
In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’...
You Can’t Fix What You Don’t Measure: Observability in the Age of AI with Conor Bronsdon
05 Nov 2025
Contributed by Lukas
Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kal...
The End of “Good Code”? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber
10 Sep 2025
Contributed by Lukas
Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of Humans of Reliability, Rob Zuber, CircleCI...
Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)
20 Aug 2025
Contributed by Lukas
What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner...
Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)
03 Jul 2025
Contributed by Lukas
Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the...
Command Under Pressure: David Owczarek on Incident Leadership and Human-Centered Reliability
17 Jun 2025
Contributed by Lukas
Incident response is as much about people as it is about systems. In this episode, David Owczarek, a veteran engineer leader and seasoned incident com...
AI at the Frontlines of Healthcare Reliability with Ryan Lockard (CVS Health)
30 May 2025
Contributed by Lukas
AI is transforming reliability work—from reactive firefighting to proactive engineering. In this episode, Ryan Lockard, VP of Platform Engineering a...
Trust Is the Product: Building Reliable Billing in the AI Era with Cosmo Wolfe (Metronome)
26 May 2025
Contributed by Lukas
In this episode, we sit down with Cosmo Wolfe, Head of Technology at Metronome, to unpack how reliability, trust, and architecture intersect in one of...
The Golden Path to Nowhere: When Platforms Undermine Reliability with Chase Roberts (Northflank)
14 May 2025
Contributed by Lukas
Internal platforms promise speed, consistency, and scale — but what happens when they become a distraction? In this episode, Chase Roberts, COO at N...
AI can boost developer productivity, if used right, with Justin Reock, Deputy CTO at DX
30 Apr 2025
Contributed by Lukas
In this episode of Humans of Reliability, we sit down with Justin Reock, Deputy CTO at DX, to unpack the real impact of generative AI on developer pro...
Why Reliability in the AI Era Starts with the Network with Marino Wijay
17 Apr 2025
Contributed by Lukas
In this episode, we explore how networking has shaped reliability as we know it. Marino Wijay cloud networking expert and Staff Solutions Architect at...
Metrics That Matter: Measuring Developer Productivity in the AI Era
09 Apr 2025
Contributed by Lukas
In this episode of Humans of Reliability, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at 90, for a conversation that cuts th...
Are AI and Platforms Making SRE Obsolete? With Kaspar von Grünberg, Humanitec’s CEO
24 Mar 2025
Contributed by Lukas
Last year, over 89% of companies claimed to have adopted platform engineering. And, in the past month, LLMs have been disrupting how we think about so...
Scientific Incident Management with Dan Slimmon
14 Mar 2025
Contributed by Lukas
Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucrat...
How AI broke serverless and what to do about it with Vercel’s Mariano Fernández Cocirio
06 Mar 2025
Contributed by Lukas
Mariano, Staff Product Manager at Vercel, explains why serverless architectures are hitting unexpected limits—they’re too fast. The industry has ...
I Want My Shoes Fast! Observability, SRE Burnout, and OTel with Dynatrace’s Adriana Villela
27 Feb 2025
Contributed by Lukas
In this episode, we sit down with Adriana Villela, Principal DevRel at Dynatrace and OpenTelemetry contributor to break down how observability impacts...
AI in Production with GitHub’s Sean Goedecke
18 Feb 2025
Contributed by Lukas
In this episode, we sit down with Sean Goedecke, Staff Software Engineer at GitHub, to discuss where LLMs fit into real-world development. Sean share...
The Reliability Diagnosis: Google’s Steve McGhee on Debugging and Incident Response
10 Feb 2025
Contributed by Lukas
In this episode of Humans of Reliability, we sit down with Steve McGhee, Reliability Advocate at Google, to discuss his journey from early SRE work to...
No CS Degree, No Problem: Building a Career in Tech Leadership
05 Feb 2025
Contributed by Lukas
What does it take to lead service delivery at a company experiencing massive growth? Hannah Hammonds, Service Delivery Lead at Prolific, shares her j...
Beyond SLOs: How an ex-Google SRE scaled reliability at the largest e-commerce in the nordics
03 Feb 2025
Contributed by Lukas
What happens when a Google-trained SRE joins a fast-moving e-commerce company? Gastón Rial Saibene, SRE Lead at Boozt.com, joins Humans of Reliabili...
The Domino Effect of Outages with Nuno Tomás, Founder of isDown.app
24 Jan 2025
Contributed by Lukas
🎙️ Humans of Reliability: Keeping systems up and the lights on isn’t just about technology—it’s about the people behind it. In this episode...