Marc Brooker

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

A great post-mortem not only identifies kind of fixes to the proximal cause, but also identifies broader fixes to technology, to organizations, to products and so on.

797.51 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so that's a kind of multiple levels thing, right?

811.712 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

You can't get stuck on...

814.997 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

you know, what is the most proximal cause of an incident, but you also can't get stuck on this, well, you know, things fail sometimes and what are we gonna do about it?

817.338 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And you have to come up with a set of, you know, really concrete action items

827.676 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

to fix things at different levels, fix this particular line in the software that caused something, you know, fix the testing processes that didn't catch that, you know, fix the, you know, maybe social or team processes that led to those technical processes.

834.668 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And, you know, and then if you're seeing patterns across multiple postmortems,

853.23 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

sort of level those up and say, well, clearly there's a hard underlying problem here.

859.677 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Can we build a service around that?

864.767 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Can we build a library around that?

867.171 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Can we build a community of practice around that?

869.576 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Are there technical changes we can make to avoid whole classes of things?

873.624 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

So that's quite a long-winded answer, but I do think it all flows from understanding and understanding at multiple levels, like understanding immediately what happened, but also understanding broadly what happened technologically and organizationally and in context, and then the ability to connect that particular event or post-mortem with other ones and extract those patterns.

882.037 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

One of the things that we did in D-SQL was we spent a lot of time as we were designing that looking around relational database-related postmortems and thinking about both our own and our customers and thinking about how can we design a database that helps people avoid falling into these traps.

911.45 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And a really common kind of outage pattern, folks with relational databases, is you have a client

929.216 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

on a distributed system, starts a transaction, and then goes out to lunch for whatever reason.

939.648 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And that could be a GC pause, or it could be a lossy network, or it could be a loss of connectivity, and now it's holding locks.

945.968 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so if you look at relational databases, they don't tend to be resilient to clients misbehaving in that way.

952.658 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And that's a really common cause of operational issues for systems built on relational databases.

960.151 View full episode →

The Peterman Pod

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so as we were designing D-SQL, we were thinking, how do we avoid this?

967.264 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment