Platform Engineering Playbook Podcast
Episodes
AWS re:Invent 2025 Part 1/4 - The Agentic AI Revolution
08 Dec 2025
Contributed by Lukas
AWS announces autonomous AI agents that can work for days without human intervention. The DevOps Agent is an always-on incident responder. The Securit...
Developer Experience Metrics Beyond DORA
07 Dec 2025
Contributed by Lukas
DORA metrics revolutionized how we measure DevOps performance, but are we missing the bigger picture? This episode explains DORA from the ground up—...
Cloudflare's Trust Crisis - December 2025 Outage and the Human Cost
06 Dec 2025
Contributed by Lukas
Three weeks after their worst outage since 2019, Cloudflare went down again. On December 5, 2025, a Lua code bug took down 28% of HTTP traffic for 25 ...
Cloud Cost Quick Wins for Year-End
05 Dec 2025
Contributed by Lukas
Global cloud spend hits $720 billion in 2025—and organizations waste 20-30% on unused resources. Year-end is the perfect time to show savings before...
Platform Engineering vs DevOps vs SRE - The Identity Crisis
04 Dec 2025
Contributed by Lukas
Platform Engineer roles pay 20% more than DevOps Engineer roles, but job descriptions are 90% identical. Is Platform Engineering just DevOps with bett...
Platform Engineering Certification Tier List 2025
03 Dec 2025
Contributed by Lukas
Are certifications worth it? The answer is: it depends. And that's precisely the problem. In this episode, Jordan and Alex rank 25+ certifications usi...
Kubernetes AI Conformance - The End of AI Infrastructure Chaos
02 Dec 2025
Contributed by Lukas
The Wild West of AI infrastructure just ended. CNCF launched the Certified Kubernetes AI Conformance Program at KubeCon Atlanta on November 11, 2025. ...
Helm 4 - The Definitive Guide to the Biggest Update in 6 Years
01 Dec 2025
Contributed by Lukas
Helm 4.0 dropped at KubeCon Atlanta 2025, marking the biggest update in 6 years. Server-Side Apply finally ends the GitOps ownership wars. WASM plugin...
CNPE Certification Guide - The First Platform Engineering Credential
30 Nov 2025
Contributed by Lukas
CNCF just launched the first-ever hands-on platform engineering certification at KubeCon Atlanta 2025. But with beta testers reporting 29% scores, is ...
10 Platform Engineering Anti-Patterns That Kill Developer Productivity
29 Nov 2025
Contributed by Lukas
DORA 2024 found organizations with platform teams saw throughput decrease by 8% and stability decrease by 14%. Wait—isn't platform engineering suppo...
Black Friday War Stories: Lessons from E-Commerce's Worst Days
28 Nov 2025
Contributed by Lukas
Why do major retailers with unlimited budgets still crash on Black Friday? This episode dives into the graveyard of e-commerce outages—from J.Crew's...
Giving Thanks to Your Dependencies: A Platform Engineer's Gratitude Guide
27 Nov 2025
Contributed by Lukas
This Thanksgiving, let's talk about the people you've never thanked. 60% of open source maintainers are unpaid. 60% have left or considered leaving. Y...
KubeCon Atlanta 2025 Part 3: Community at 10 Years - The Sustainability Question
26 Nov 2025
Contributed by Lukas
CNCF celebrates 10 years with 300,000 contributors and 230+ projects—but the hallway track told a different story. 60% of maintainers unpaid. 60% ha...
KubeCon Atlanta 2025 Part 2: Platform Engineering Consensus and Community Reality Check
25 Nov 2025
Contributed by Lukas
After years of "what even IS platform engineering" debates, KubeCon 2025 delivered consensus: three non-negotiable principles, real-world adoption at ...
KubeCon 2025 Part 1: AI Goes Native and the 30K Core Lesson
24 Nov 2025
Contributed by Lukas
Google donates a GPU driver live on stage. OpenAI saves $2.16M/month with one line of code. Kubernetes rollback finally works after 10 years. What cha...
The $4,350/Month GPU Waste Problem: How Kubernetes Architecture Creates Massive Cost Inefficiency
23 Nov 2025
Contributed by Lukas
Your H100 costs $5,000 per month, but you're only using it at 13% capacity—wasting $4,350 monthly per GPU. Analysis of 4,000+ Kubernetes clusters re...
Service Mesh Showdown: Why User-Space Beat eBPF
22 Nov 2025
Contributed by Lukas
Kernel-level eBPF should beat user-space proxies—but Istio Ambient delivers 8% mTLS overhead while Cilium shows 99%. Academic benchmarks reveal why ...
The Terraform vs OpenTofu Debate - Why "Just Switch" Is Bad Advice
21 Nov 2025
Contributed by Lukas
HashiCorp's license change and IBM's $6.4B acquisition created the "you must migrate" narrative—but 70% of teams using Terraform in-house aren't leg...
Agentic DevOps: GitHub Agent HQ and the Autonomous Pipeline Revolution
20 Nov 2025
Contributed by Lukas
GitHub Universe 2025 announced Agent HQ—mission control for orchestrating AI agents from OpenAI, Anthropic, Google, and more. Azure SRE Agent saved ...
Cloudflare Outage November 2025: When a Rust Panic Took Down 20% of the Internet
19 Nov 2025
Contributed by Lukas
A routine database permissions change triggered Cloudflare's worst outage since 2019—taking down ChatGPT, X, Shopify, Discord, and 20% of the intern...
Ingress NGINX Retirement: The March 2026 Migration Deadline
19 Nov 2025
Contributed by Lukas
The de facto standard Kubernetes ingress controller will stop receiving security patches in March 2026—and only 1-2 people have been maintaining it ...
OpenTelemetry eBPF Instrumentation: Zero-Code Observability Under 2% Overhead
18 Nov 2025
Contributed by Lukas
What if you could achieve complete observability coverage—every HTTP request, database query, and gRPC call—without touching application code? Jor...
The Open Source Observability Showdown: When "Free" Costs $12K/Month
17 Nov 2025
Contributed by Lukas
Prometheus is free, Grafana is free, Loki is free—yet Datadog posted $2.3B in revenue and Shopify runs a 15-person team just to manage their observa...
The Kubernetes Complexity Backlash: When Simpler Infrastructure Wins
16 Nov 2025
Contributed by Lukas
Kubernetes commands 92% market share, yet 88% report year-over-year cost increases and 25% plan to shrink deployments. We unpack the 3-5x cost underes...
SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform Engineering
16 Nov 2025
Contributed by Lukas
Only 26% of organizations actively use SLOs after a decade of Google's SRE principles being gospel. We explore why adoption is so low despite 49% sayi...
Internal Developer Portal Showdown 2025: Backstage vs Port vs Cortex vs OpsLevel
14 Nov 2025
Contributed by Lukas
Your team spent 6 months implementing Backstage. Adoption? 8%. The CFO asks: "Why didn't we buy a solution?" Here's the 2025 comparison with real pric...
DNS for Platform Engineering: The Silent Killer
13 Nov 2025
Contributed by Lukas
Why does a forty-year-old protocol keep taking down billion-dollar infrastructure? The October 2025 AWS outage lasted fifteen hours because of a DNS r...
eBPF in Kubernetes: Kernel-Level Superpowers Without the Risk
12 Nov 2025
Contributed by Lukas
Your Kubernetes cluster is a black box—Prometheus shows symptoms, not causes. eBPF turns the Linux kernel into a programmable platform for observabi...
Time Series Language Models
11 Nov 2025
Contributed by Lukas
AI models that can read your infrastructure metrics like language, explain anomalies in plain English, and predict failures without training on your d...
Title: Kubernetes IaC & GitOps - The Workflow Paradox
11 Nov 2025
Contributed by Lukas
77% of organizations have adopted GitOps, 60% run ArgoCD—yet platform teams are still bottlenecks and deployments still take days. Jordan and Alex i...
The FinOps AI Paradox: Why Smart Tools Don't Cut Costs (And What Actually Does)
09 Nov 2025
Contributed by Lukas
Your company spent $500K on AI-powered FinOps tools. The AI identified $3M in potential savings. Ninety days later, you've implemented $180K—just 6%...
The DevOps Toolchain Crisis: Why Adding Tools Makes Teams Slower
08 Nov 2025
Contributed by Lukas
Your team spent $500K on productivity tools. So why are engineers slower than last year? Jordan and Alex unpack the hidden crisis: 75% of teams lose 1...
Kubernetes Production Mastery Lesson 3: Health Checks & Probes
07 Nov 2025
Contributed by Lukas
Learn how to configure Kubernetes health checks that prevent production outages. This episode covers the three types of probes (liveness, readiness, s...
Kubernetes Production Mastery Lesson 3: Security Foundations - RBAC & Secrets
06 Nov 2025
Contributed by Lukas
RBAC misconfiguration is the number one Kubernetes security vulnerability. Learn how to implement namespace-scoped RBAC roles, secure secrets manageme...
The Cloud Repatriation Debate: When AWS Costs 10-100x More Than It Should
05 Nov 2025
Contributed by Lukas
An in-depth analysis of cloud repatriation economics, examining real companies saving millions by leaving AWS. Jordan and Alex discuss 37signals' $2M ...
Kubernetes in 2025: The Maturity Paradox
04 Nov 2025
Contributed by Lukas
Kubernetes has 92% market share, but "do we actually need this?" is the loudest conversation in platform engineering. This episode explores the maturi...
Backstage in Production: The 10% Adoption Problem
03 Nov 2025
Contributed by Lukas
Your team spent 9 months implementing Backstage. The portal looks beautiful. But internal adoption? 8%. Spotify's VP of Engineering has publicly ackno...
Platform Engineering ROI Calculator: Prove Value to Executives
30 Oct 2025
Contributed by Lukas
45% of platform teams measure nothing and get disbanded when they can't prove ROI. Jordan and Alex break down the exact ROI calculation framework that...
Why 70% of Platform Engineering Teams Fail (And the 5 Metrics That Predict Success)
28 Oct 2025
Contributed by Lukas
60-70% of platform engineering teams fail to deliver impact, with 45% disbanded within 18 months. We investigate why technically excellent teams with ...
Lesson 02: Resource Management - Kubernetes Production Mastery
28 Oct 2025
Contributed by Lukas
Your pods keep getting OOMKilled at the worst possible times. In this lesson, you'll master the difference between requests and limits, understand the...
Kubernetes Production Mastery - Lesson 01: Production Mindset
27 Oct 2025
Contributed by Lukas
Transform from a Kubernetes user into a production engineer. Learn the mental shift from development to production, identify the 5 failure patterns th...
GCP State of the Union 2025 - When Depth Beats Breadth
26 Oct 2025
Contributed by Lukas
GCP grows at 32% while AWS manages 17%—nearly 2x faster despite having half the services. We break down why Google's depth-over-breadth strategy is ...
The $75 Million Per Hour Lesson: Inside the 2025 AWS us-east-1 Outage
25 Oct 2025
Contributed by Lukas
October 19, 2025. 11:48 PM: A DNS race condition in DynamoDB took down 70 AWS services for 14 hours, affecting 1,000+ companies and costing $75M/hour....
AWS State of the Union 2025 - Navigate 200+ Services with Strategic Clarity
24 Oct 2025
Contributed by Lukas
AWS has over 200 services, but which 20 actually matter for your platform? We cut through the documentation maze to give you strategic service selecti...
Platform Tools Tier List
23 Oct 2025
Contributed by Lukas
Which platform engineering skills command $24,000+ higher salaries? We analyze 220+ tools from the Dice 2025 Tech Salary Report, break down the commod...
Same App: $41 on Railway vs $1,010 on Vercel - The Real Cost of 'Simple' PaaS
22 Oct 2025
Contributed by Lukas
Everyone promises Heroku-like simplicity with cloud-scale performance, but which PaaS actually delivers? We break down real-world costs for identical ...
130 Tools, 20% Utilization, $71K/Year Lost Per Engineer - The Platform Sprawl Tax
21 Oct 2025
Contributed by Lukas
Enterprise teams manage 130+ tools but only use 10-20% of their capabilities. Engineers juggle 16 monitoring tools on average—40 when SLAs get stric...
Cloud Providers in 2025 - Platform Abstractions, GPU Dynamics, and the New Multi-Cloud Reality
20 Oct 2025
Contributed by Lukas
AWS still dominates at 32% market share, but new deployments tell a different story. Platform abstractions (Vercel, Fly.io, Railway) mean developers n...
75% of Your Team Uses Unauthorized AI - Why Your Blocking Strategy Backfires
19 Oct 2025
Contributed by Lukas
85% of organizations are facing a crisis: employees adopting AI tools 890% faster than IT can assess them. The "just block it" approach fails 100% of ...