Marc Brooker
๐ค SpeakerAppearances Over Time
Podcast Appearances
A great post-mortem not only identifies kind of fixes to the proximal cause, but also identifies broader fixes to technology, to organizations, to products and so on.
And so that's a kind of multiple levels thing, right?
You can't get stuck on...
you know, what is the most proximal cause of an incident, but you also can't get stuck on this, well, you know, things fail sometimes and what are we gonna do about it?
And you have to come up with a set of, you know, really concrete action items
to fix things at different levels, fix this particular line in the software that caused something, you know, fix the testing processes that didn't catch that, you know, fix the, you know, maybe social or team processes that led to those technical processes.
And, you know, and then if you're seeing patterns across multiple postmortems,
sort of level those up and say, well, clearly there's a hard underlying problem here.
Can we build a service around that?
Can we build a library around that?
Can we build a community of practice around that?
Are there technical changes we can make to avoid whole classes of things?
So that's quite a long-winded answer, but I do think it all flows from understanding and understanding at multiple levels, like understanding immediately what happened, but also understanding broadly what happened technologically and organizationally and in context, and then the ability to connect that particular event or post-mortem with other ones and extract those patterns.
One of the things that we did in D-SQL was we spent a lot of time as we were designing that looking around relational database-related postmortems and thinking about both our own and our customers and thinking about how can we design a database that helps people avoid falling into these traps.
And a really common kind of outage pattern, folks with relational databases, is you have a client
on a distributed system, starts a transaction, and then goes out to lunch for whatever reason.
And that could be a GC pause, or it could be a lossy network, or it could be a loss of connectivity, and now it's holding locks.
And so if you look at relational databases, they don't tend to be resilient to clients misbehaving in that way.
And that's a really common cause of operational issues for systems built on relational databases.
And so as we were designing D-SQL, we were thinking, how do we avoid this?