Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Humans of Reliability

Technology

Activity Overview

Episode publication activity over the past year

Episodes

The Golden Hour: Why the First 15 Minutes of an Incident Decide Everything w/ Gandhi M. N. Kumar (Twillio)

28 Apr 2026

Contributed by Lukas

Most incident response advice focuses on tools, alerts, and post-mortems. Gandhi Mathi Nathan Kumar, Principal Incident Commander at Twilio, with 14 y...

From 600 to 6,000: Federating Incident Response w/ Cliff Snyder (ex-LinkedIn SRE)

22 Apr 2026

Contributed by Lukas

A centralized SRE team of 600 engineers as the first line of defense for every incident works - until the business asks you to spread that responsibil...

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

09 Apr 2026

Contributed by Lukas

Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-F...

Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)

31 Mar 2026

Contributed by Lukas

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the...

The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)

17 Mar 2026

Contributed by Lukas

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis...

Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend

04 Mar 2026

Contributed by Lukas

Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mount...

Code Is Cheap, Reliability Isn’t: Owning Production in the AI era w/ Swizec Teller

16 Feb 2026

Contributed by Lukas

Code has never been easier to write. With AI copilots and agentic coding tools, spinning up features feels almost effortless. But production systems d...

Democratizing Reliability: Empowering Non-Devs with Dileshni Jayasinghe (commonsku)

14 Jan 2026

Contributed by Lukas

Many companies don’t invest in incident management until something goes wrong. commonsku took a different path.In this episode of Humans of Reliabil...

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Tomás Hernando Koffman (Not Diamond)

22 Dec 2025

Contributed by Lukas

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated a...

The Reality of GenAI in Production with Eduardo Ordax (AWS)

12 Dec 2025

Contributed by Lukas

GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually sto...

It’s Never Different This Time: LLM Reliability Without the Hype with Julien Simon

19 Nov 2025

Contributed by Lukas

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’...

You Can’t Fix What You Don’t Measure: Observability in the Age of AI with Conor Bronsdon

05 Nov 2025

Contributed by Lukas

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kal...

The End of “Good Code”? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber

10 Sep 2025

Contributed by Lukas

Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of Humans of Reliability, Rob Zuber, CircleCI...

Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)

20 Aug 2025

Contributed by Lukas

What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner...

Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)

03 Jul 2025

Contributed by Lukas

Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the...

Command Under Pressure: David Owczarek on Incident Leadership and Human-Centered Reliability

17 Jun 2025

Contributed by Lukas

Incident response is as much about people as it is about systems. In this episode, David Owczarek, a veteran engineer leader and seasoned incident com...

AI at the Frontlines of Healthcare Reliability with Ryan Lockard (CVS Health)

30 May 2025

Contributed by Lukas

AI is transforming reliability work—from reactive firefighting to proactive engineering. In this episode, Ryan Lockard, VP of Platform Engineering a...

Trust Is the Product: Building Reliable Billing in the AI Era with Cosmo Wolfe (Metronome)

26 May 2025

Contributed by Lukas

In this episode, we sit down with Cosmo Wolfe, Head of Technology at Metronome, to unpack how reliability, trust, and architecture intersect in one of...

The Golden Path to Nowhere: When Platforms Undermine Reliability with Chase Roberts (Northflank)

14 May 2025

Contributed by Lukas

Internal platforms promise speed, consistency, and scale — but what happens when they become a distraction? In this episode, Chase Roberts, COO at N...

AI can boost developer productivity, if used right, with Justin Reock, Deputy CTO at DX

30 Apr 2025

Contributed by Lukas

In this episode of Humans of Reliability, we sit down with Justin Reock, Deputy CTO at DX, to unpack the real impact of generative AI on developer pro...

Why Reliability in the AI Era Starts with the Network with Marino Wijay

17 Apr 2025

Contributed by Lukas

In this episode, we explore how networking has shaped reliability as we know it. Marino Wijay cloud networking expert and Staff Solutions Architect at...

Metrics That Matter: Measuring Developer Productivity in the AI Era

09 Apr 2025

Contributed by Lukas

In this episode of Humans of Reliability, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at 90, for a conversation that cuts th...

Are AI and Platforms Making SRE Obsolete? With Kaspar von Grünberg, Humanitec’s CEO

24 Mar 2025

Contributed by Lukas

Last year, over 89% of companies claimed to have adopted platform engineering. And, in the past month, LLMs have been disrupting how we think about so...

Scientific Incident Management with Dan Slimmon

14 Mar 2025

Contributed by Lukas

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucrat...

How AI broke serverless and what to do about it with Vercel’s Mariano Fernández Cocirio

06 Mar 2025

Contributed by Lukas

Mariano, Staff Product Manager at Vercel, explains why serverless architectures are hitting unexpected limits—they’re too fast. The industry has ...

I Want My Shoes Fast! Observability, SRE Burnout, and OTel with Dynatrace’s Adriana Villela

27 Feb 2025

Contributed by Lukas

In this episode, we sit down with Adriana Villela, Principal DevRel at Dynatrace and OpenTelemetry contributor to break down how observability impacts...

AI in Production with GitHub’s Sean Goedecke

18 Feb 2025

Contributed by Lukas

In this episode, we sit down with Sean Goedecke, Staff Software Engineer at GitHub, to discuss where LLMs fit into real-world development. Sean share...

The Reliability Diagnosis: Google’s Steve McGhee on Debugging and Incident Response

10 Feb 2025

Contributed by Lukas

In this episode of Humans of Reliability, we sit down with Steve McGhee, Reliability Advocate at Google, to discuss his journey from early SRE work to...

No CS Degree, No Problem: Building a Career in Tech Leadership

05 Feb 2025

Contributed by Lukas

What does it take to lead service delivery at a company experiencing massive growth? Hannah Hammonds, Service Delivery Lead at Prolific, shares her j...

Beyond SLOs: How an ex-Google SRE scaled reliability at the largest e-commerce in the nordics

03 Feb 2025

Contributed by Lukas

What happens when a Google-trained SRE joins a fast-moving e-commerce company? Gastón Rial Saibene, SRE Lead at Boozt.com, joins Humans of Reliabili...

The Domino Effect of Outages with Nuno Tomás, Founder of isDown.app

24 Jan 2025

Contributed by Lukas

🎙️ Humans of Reliability: Keeping systems up and the lights on isn’t just about technology—it’s about the people behind it. In this episode...