We hit a new (and disturbing!) failure mode recently when a production rack that had been up for several months saw every (!) compute sled's service processor become simultaneously unresponsive. Bryan and Adam were joined by the members of the Oxide team who debugged the vexing issue -- and reached its surprising root cause.In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleagues, Cliff Biffle, Matt Keeter, and Will Chandler.Previously, on Oxide and Friends:OxF s05e03 – Holistic Engineering with Robert MustacchiOxF s04e14 – Rebooting a datacenter: A decade laterOxF s01e26 – The Pragmatism of HubrisOxF s05e20 – Debugger-Driven Development (omdb)OxF s05e07 – Transparency in Hardware/Software InterfacesOxF s05e31 – FuturelockOxF s05e33 – A Grown-up ZFS Data Corruption BugSome of the topics we hit on, in the order that we hit them:hubris #2304: STM32H7 Ethernet driver stops yielding CPU after many packetsgist — Summarizing the Hubris side of investigationsMatt's blog: Hunting a spooky ethernet driver bugIf we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other episodes from Oxide and Friends
Transcribed and ready to explore now
A Half-Century of Silicon Valley with Randy Shoup
24 Feb 2025
Oxide and Friends
Holistic Engineering with Robert Mustacchi
23 Jan 2025
Oxide and Friends
Crates We Love
16 Jan 2025
Oxide and Friends
Predictions 2025
10 Jan 2025
Oxide and Friends
OxF 2024 Wrap-Up
30 Dec 2024
Oxide and Friends
Scaling Bluesky with Paul Frazee
19 Dec 2024
Oxide and Friends