Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

Oxide and Friends

Death by Uptime

08 Dec 2025

Description

We hit a new (and disturbing!) failure mode recently when a production rack that had been up for several months saw every (!) compute sled's service processor become simultaneously unresponsive. Bryan and Adam were joined by the members of the Oxide team who debugged the vexing issue -- and reached its surprising root cause.In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleagues, Cliff Biffle, Matt Keeter, and Will Chandler.Previously, on Oxide and Friends:OxF s05e03 – Holistic Engineering with Robert MustacchiOxF s04e14 – Rebooting a datacenter: A decade laterOxF s01e26 – The Pragmatism of HubrisOxF s05e20 – Debugger-Driven Development (omdb)OxF s05e07 – Transparency in Hardware/Software InterfacesOxF s05e31 – FuturelockOxF s05e33 – A Grown-up ZFS Data Corruption BugSome of the topics we hit on, in the order that we hit them:hubris #2304: STM32H7 Ethernet driver stops yielding CPU after many packetsgist — Summarizing the Hubris side of investigationsMatt's blog: Hunting a spooky ethernet driver bugIf we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.