Martin Kleppmann
๐ค SpeakerAppearances Over Time
Podcast Appearances
Distributed system theory just doesn't make any assumptions about that sort of timing if we can avoid it.
Or rather, some theory does make those assumptions, but it's a dangerous assumption to make because occasionally the network delay does become much higher than what is typical.
Another thing is about crashes, for example.
distributed system theory just says like nodes can crash but what does that actually mean like what in practice does it mean for a node to become unavailable because it might be a software crash but yes it might be a hardware failure it might be somebody unplugging the power cable it might be that
the node is actually still running, but it's just become disconnected from the network.
The point of this book chapter really is to defend and justify those theoretical models that we use for analyzing distributed systems and just giving a lot of stories and case studies that show that actually tons of stuff does go wrong.
and like don't believe anyone who says oh failures are rare it's don't don't worry about it it's fine uh the the moral of this chapter is really that actually you know if you want to make things reliable you really do have to worry about a whole bunch of weird unusual but but certainly possible edge cases timing is another one of those things like you know it's very easy to assume that your clocks are correct and most of the times the clocks are pretty correct
But we just can't rely on it because actually they're just not precise enough on the whole.
And so a lot of it is about it's very tempting to make certain assumptions that things are well behaved and in distributed systems, we just have to try to get away from those assumptions if we want the systems to work reliably, even in the face of things going wrong.
But it was a really fun chapter to write because, you know, it's essentially a big collection of stuff that has gone wrong.
And so I went through a bunch of postmortems published by various tech companies, for example, in order to see, OK, what was the root cause of how things went wrong and what kind of lessons can we draw from this that apply to the book in general?
And, you know, there's some fun stuff like the sharks biting undersea cables and damaging them.
That just, you know, makes for a great story.
And then I hear that in recent years, the shielding of undersea cables has got better and therefore the sharks are not biting them anymore.
But instead, the cows on land are stepping on cables and occasionally causing network interruptions that way.
And, you know, that sort of thing is just, it makes it a bit more fun.
yeah but i think there's there's no like right answer it's a it's a trade-off between risk and cost broadly speaking and that means a business decision has to be made in terms of where the business wants to lie uh on that trade-off and so the goal of this chapter is really just to give people the information in order to make an educated decision but i don't want to make that decision for people that's for businesses themselves to decide that's very clear
Yeah, so there are some things that we've been able to take out of the book compared to the first edition.
In particular, for example, coverage of MapReduce was quite detailed in the first edition, but basically MapReduce is dead.
Nobody uses it anymore.