Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Marc Brooker

๐Ÿ‘ค Speaker
499 total appearances

Appearances Over Time

Podcast Appearances

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

how are you supposed to identify what to fix, right?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

You can come up with some theories about those, but they're probably not gonna be right.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And again, I don't think there's a huge amount of value in,

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

the rote ticket closing work of on-call.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

I think automation should be doing those kinds of work, but I think there's fantastic value in deep understanding, deep investigations, and deep reflection on what you learn from postmortems and COEs.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

I tried to estimate a couple of months ago for a talk how many industry postmortems and Amazon COEs I'd read over my career

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

The best estimate I could come up to, and this was about a year ago, was between 3,000 and 4,000.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so, you know, even a little bit of lesson from each one, and it tends to stick.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

So I think what makes a really great post-mortem is first really getting into the details and making sure that you deeply understand what happened rather than just assuming what happened based on the biases you bring in.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so there's a kind of lesson one there is if you can't understand what happened, well, that teaches you something about your logging and metrics and observability and simulations and all of these other things.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And then once you deeply understand what happened, then the ability, then a great post-mortem steps through the whys behind that at multiple levels, right?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Like why, well, yeah, there was a code bug.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Okay, sure, code bugs, yes, we can fix that.

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

But we can't stop there, right?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Like why was that missed in testing and validation, right?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

you know, for these reasons, you know, what can we improve?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

What can we build around those?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Okay, next step, you know, why, you know, why was our testing invalidation where it was?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Or, you know, why did we assume a certain thing about the behavior of the system that we wouldn't have assumed before?

The Peterman Pod
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

And so as you sort of get through these deeper and deeper layers,