Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Marc Brooker

๐Ÿ‘ค Speaker
499 total appearances

Appearances Over Time

Podcast Appearances

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

A great post-mortem not only identifies kind of fixes to the proximal cause, but also identifies broader fixes to technology, to organizations, to products and so on.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so that's a kind of multiple levels thing, right?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

You can't get stuck on...

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

you know, what is the most proximal cause of an incident, but you also can't get stuck on this, well, you know, things fail sometimes and what are we gonna do about it?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And you have to come up with a set of, you know, really concrete action items

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

to fix things at different levels, fix this particular line in the software that caused something, you know, fix the testing processes that didn't catch that, you know, fix the, you know, maybe social or team processes that led to those technical processes.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And, you know, and then if you're seeing patterns across multiple postmortems,

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

sort of level those up and say, well, clearly there's a hard underlying problem here.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Can we build a service around that?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Can we build a library around that?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Can we build a community of practice around that?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Are there technical changes we can make to avoid whole classes of things?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

So that's quite a long-winded answer, but I do think it all flows from understanding and understanding at multiple levels, like understanding immediately what happened, but also understanding broadly what happened technologically and organizationally and in context, and then the ability to connect that particular event or post-mortem with other ones and extract those patterns.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

One of the things that we did in D-SQL was we spent a lot of time as we were designing that looking around relational database-related postmortems and thinking about both our own and our customers and thinking about how can we design a database that helps people avoid falling into these traps.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And a really common kind of outage pattern, folks with relational databases, is you have a client

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

on a distributed system, starts a transaction, and then goes out to lunch for whatever reason.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And that could be a GC pause, or it could be a lossy network, or it could be a loss of connectivity, and now it's holding locks.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so if you look at relational databases, they don't tend to be resilient to clients misbehaving in that way.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And that's a really common cause of operational issues for systems built on relational databases.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so as we were designing D-SQL, we were thinking, how do we avoid this?