Reliability Enablers
Episodes
You (and AI) can't automate reliability away
02 Dec 2025
Contributed by Lukas
What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the hum...
#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran
15 Jul 2025
Contributed by Lukas
A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much.In this episode,...
#66 - Unpacking 2025 SRE Report’s Damning Findings
01 Jul 2025
Contributed by Lukas
I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consul...
#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability
17 Jun 2025
Contributed by Lukas
Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?” But in the energy sector? There is...
#64 - Using AI to Reduce Observability Costs
28 Jan 2025
Contributed by Lukas
Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions.It's been a hot minute since t...
#63 - Does "Big Observability" Neglect Mobile?
12 Nov 2024
Contributed by Lukas
Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his...
#62 - Early Youtube SRE shares Modern Reliability Strategy
05 Nov 2024
Contributed by Lukas
Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking in...
#61 Scott Moore on SRE, Performance Engineering, and More
22 Oct 2024
Contributed by Lukas
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#60 How to NOT fail in Platform Engineering
01 Oct 2024
Contributed by Lukas
Here’s what we covered:Defining Platform Engineering* Platform engineering: Building compelling internal products to help teams reuse capabilities w...
#59 Who handles monitoring in your team and how?
24 Sep 2024
Contributed by Lukas
Why many copy Google’s monitoring team setup* Google’s Influence. Google played a key role in defining the concept of software reliability.* Succe...
#58 Fixing Monitoring's Bad Signal-to-Noise Ratio
17 Sep 2024
Contributed by Lukas
Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the ...
#57 How Technical Leads Support Software Reliability
10 Sep 2024
Contributed by Lukas
The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even...
#56 Resolving DORA Metrics Mistakes
04 Sep 2024
Contributed by Lukas
We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seem...
#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards
27 Aug 2024
Contributed by Lukas
We’ll explore 3 use cases for monitoring data. They are:* Analyzing long-term trends* Comparing over time or experiment groups* Conducting ad hoc re...
#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity
20 Aug 2024
Contributed by Lukas
Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much ...
#53 What's Missing in Incident Response Processes?
15 Aug 2024
Contributed by Lukas
Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. Howe...
Can ITIL Benefit from Site Reliability Engineering?
13 Aug 2024
Contributed by Lukas
According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something complete...
#52 Navigating Complexity within Incidents
06 Aug 2024
Contributed by Lukas
Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is!Our systems are becomi...
#51 Whitebox vs Blackbox Monitoring
30 Jul 2024
Contributed by Lukas
Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitorin...
#50 Making Better Sense of Observability Data
09 Jul 2024
Contributed by Lukas
Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data.We crammed into...
#49 Alert Fatigue is Still an Issue - Here's How We Fix it
02 Jul 2024
Contributed by Lukas
Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topi...
#48 Cutting Down "Toil" aka Manual Work in Software
25 Jun 2024
Contributed by Lukas
Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot w...
#47 How to Grow Team Impact Through Learning Culture
18 Jun 2024
Contributed by Lukas
The common refrain after an incident is “We could and should learn from this”. To me, that alludes to the need for a robust learning culture.We mi...
#46 Platform Team Design According to Team Team Topologies
11 Jun 2024
Contributed by Lukas
I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.In thi...
#45 How Team Topologies Can Guide Enabling Teams
04 Jun 2024
Contributed by Lukas
I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team...
#44 - Making SLOs Matter to Stakeholders
30 May 2024
Contributed by Lukas
Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public epis...
#43 - SLOs: a Deeper Dive into its Mechanics
28 May 2024
Contributed by Lukas
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the ...
#42 - Hitting Software SLA Targets through SLOs and SLIs
21 May 2024
Contributed by Lukas
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the S...
#41 Curbing High Observability Costs
14 May 2024
Contributed by Lukas
No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot e...
#40 How to Enable Observability for Success
07 May 2024
Contributed by Lukas
Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to...
#39 How Chaos Engineering Helps Reduce Incident Risk
30 Apr 2024
Contributed by Lukas
Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduce...
#38 The Real Cost of Software Reliability & Downtime
23 Apr 2024
Contributed by Lukas
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and cho...
#37 An SRE Approach to Managing Technology Risk
16 Apr 2024
Contributed by Lukas
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspecti...
#36 Avoiding Critical Platform Engineering Mistakes
09 Apr 2024
Contributed by Lukas
Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell u...
#35 Boosting Your Observability Data's Usability
02 Apr 2024
Contributed by Lukas
The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?Richard Benwell thinks we...
#34 From Cloud to Concrete: Should You Return to On-Prem?
26 Mar 2024
Contributed by Lukas
This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem...
#33 Inside Google's Data Center Design
19 Mar 2024
Contributed by Lukas
This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design...
#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP
14 Mar 2024
Contributed by Lukas
Will Platform Engineering replace DevOps or SRE or both? I don’t think this is the case at all. Neither does Ajay Chankramath.He is the Head of Plat...
#31 Introduction to FinOps (with Ajay Chankramath)
12 Mar 2024
Contributed by Lukas
FinOps is on the tip of many tongues in the software space right now, as we try to curb our cloud costs. Ajay Chankramath has given talks on FinOps at...
#30 Clearing Delusions in Observability (with David Caudill)
07 Mar 2024
Contributed by Lukas
Observability is going through interesting times. David Caudill believes that delusions are getting in the way of our success in this area. He's a...
#29 - Reacting to Google's SRE book 2016 (Chapter 1 Part 2)
27 Feb 2024
Contributed by Lukas
Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jen...
#28 - Reacting to Google's SRE Book 2016 (Chapter 1 Part 1)
20 Feb 2024
Contributed by Lukas
Sebastian and I got together to react to and discuss 5 passages from Chapter 1 of Google's Site Reliability Engineering book (2016) by Betsy Beyer...
#27 - Growing as a Site Reliability Engineer (Part 3)
13 Feb 2024
Contributed by Lukas
Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If ...
#26 - Growing as a Site Reliability Engineer (Part 2)
08 Feb 2024
Contributed by Lukas
In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 mor...
#25 - DORA and the Pursuit of Engineering Excellence (with Tim Wheeler)
30 Jan 2024
Contributed by Lukas
DORA metrics are a hot topic among technology executives in all kinds of enterprise. But there's more to engineering culture than solely relying o...
#24 - Growing as a Site Reliability Engineer (Part 1)
23 Jan 2024
Contributed by Lukas
How can you grow as an SRE? You've probably thought about your career progression at some point. Ash put together his initial thoughts on this top...
#23 - The Danger of Unreliable Platforms (with Jade Rubick)
16 Jan 2024
Contributed by Lukas
Jade Rubick needs no introduction in the reliability and observability space. He was VP of Engineering at New Relic from 2010 to 2019. It was my pleas...
#22 - How Google does SRE Consulting (with Yury Niño Roa)
09 Jan 2024
Contributed by Lukas
I did not know that Google itself does consulting around its SRE practices. This is not a sponsored episode LOL! I wanted to talk with my SRE friend, ...
#21 - Better SRE in 2024 is all we can hope for
02 Jan 2024
Contributed by Lukas
Sebastian is back for this episode to help set out direction for 2024. We reflected during the holidays on the problems SREs faced in 2023 in terms of...
#20 Holiday Special with Stephen Townshend
19 Dec 2023
Contributed by Lukas
Join Ash Patel and Stephen Townshend for a friendly chat about what they've learned in SRE as 2023 comes toward a wrap! This is a public episode. ...
#19 How to Develop Early Career Engineers (with John Hyland)
12 Dec 2023
Contributed by Lukas
Ash Patel talks with John Hyland who ran the Ignite Program at New Relic, which is dedicated to developing early career engineers.John shares insights...
#18 Winning at SRE in Banking and Telecom (with Troy Koss)
05 Dec 2023
Contributed by Lukas
Ash Patel talks with Troy Koss who is the Director of SRE at CapitalOne, an early adopter of DevOps and SRE in the banking sector. He shares insights ...
#17 Lessons from SRE's Wild West Days (with Rick Boone)
27 Nov 2023
Contributed by Lukas
Ash Patel talks with Rick Boone who is a pioneer in SRE, having been an early AppOps engineer at Facebook and Uber's first SRE hire. He shares ama...
#16 Acing Cloud Infra in Digital Media Giant (with Sreejith Chelanchery)
21 Nov 2023
Contributed by Lukas
Ash Patel interviews Sreejith Chelchery who is SVP of Delivery and Infrastructure Engineering at Dotdash Meredith. Sreejith shares his journey from pr...
#15 Growing Reliability Engineering Across 5+ Companies (with Nash Seshan)
14 Nov 2023
Contributed by Lukas
Ash Patel talks with Nash Seshan, who has supported reliability work in over 5 organizations, including Cisco, eBay, Dropbox, Lyft, Netflix, and Wayfa...
#14 Faster Incident Resolution through Data-Driven Notebooks (with Ivan Merrill)
07 Nov 2023
Contributed by Lukas
Ash Patel talks with Ivan Merrill of Fiberplane about wrangling the big data that incidents and systems generate through collaborative notebooks. Ivan...
#13 Making Sense of OpenTelemetry and Observability (with Adriana Villela)
31 Oct 2023
Contributed by Lukas
Ash Patel talks with Adriana Villela (CNCF Ambassador, OpenTelemetry contributor, and senior developer advocate at Lightstep) about the promise of Ope...
#12 From Incident Firefighting to Reliability First (with Robert Ross)
24 Oct 2023
Contributed by Lukas
Ash Patel talks with Robert Ross of Firehydrant about his experience in offering incident management software to SREs and other software incident resp...
#11 Rising to Staff Engineer in DevOps and SRE (with Rajesh Reddy N)
17 Oct 2023
Contributed by Lukas
Ash Patel interviews Rajesh Reddy N about his experiences as a senior DevOps and SRE individual contributor. Rajesh shares his insights on having syst...
#10 Using AI for Kubernetes troubleshooting self-service (with Kyle Forster)
10 Oct 2023
Contributed by Lukas
Ash Patel interviews Kyle Forster of RunWhen about his experiences as an ex-Google director helping SREs and running an AI-based company that supports...
#9 Inside Booking.com's Site Reliability Engineering practice (with Samuele Tonon and Yoann Fouquet)
02 Oct 2023
Contributed by Lukas
In this episode of the SREpath Podcast, Ash Patel interviews two SRE managers from Booking.com, Samuele and Yoann, to gain insights into their experie...
#8 Software Reliability Ninja Who is NOT an SRE (with Pablo Bouzada)
11 Sep 2023
Contributed by Lukas
Ash Patel interviews Pablo Bouzada about his beliefs on software reliability as a non-SRE leader. They discuss the importance of effective leadership ...
What happened to the podcast?
05 Sep 2023
Contributed by Lukas
We haven't hit hard times, just doing other things for the last 2 months including making plans for more interesting episodes on this podcast! Thi...
#7 Bringing HR onboard with SRE hiring and onboarding
13 Jul 2023
Contributed by Lukas
In this episode, we highlight the importance of engaging with HR partners to establish an effective understanding of the SRE career model. This will a...
#6 Building a successful SRE practice through capabilities
29 Jun 2023
Contributed by Lukas
We discuss the need for a framework to guide the development of Site Reliability Engineers (SREs) and drive value for organizations. You will learn ab...
#5 Where does SRE fit into your organization's structure?
15 Jun 2023
Contributed by Lukas
We discuss throughout this episode the different engagement models for Site Reliability Engineering (SRE) and how to contextualize SRE into an organiz...
#4 Should organizations care about SRE?
01 Jun 2023
Contributed by Lukas
This episode discusses how Site Reliability Engineering (SRE) can be important to organizations. SRE can optimize software operations, reduce costs, s...
#3 SRE vs DevOps vs Platform Engineering
17 May 2023
Contributed by Lukas
In this episode of SREpath, Ash and Sebastian discuss the unnecessary debate surrounding Site Reliability Engineering (SRE), DevOps, and platform engi...
#2 What is Site Reliability Engineering (SRE) and what is not SRE?
04 May 2023
Contributed by Lukas
In this episode of the SREpath podcast, Ash and Sebastian explore what Site Reliability Engineering (SRE) is and how it manifests in a highly function...
#1 Introducing the SREpath podcast
20 Apr 2023
Contributed by Lukas
Welcome to the first episode of the SREpath podcast! In this episode, we'll introduce you to our podcast hosts and give you their broad-level view...