Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

The Peterman Pod

Turing Award Winner: Postgres, Disagreeing with Google, Future Problems | Mike Stonebraker

20 Apr 2026

Transcription

Chapter 1: What is the main topic discussed in this episode?

0.031 - 3.776 Mike Stonebraker

Computer science may well not be a growth industry going forward.

0

Chapter 2: How did Mike Stonebraker get into databases?

4.497 - 12.569 Mike Stonebraker

This is Mike Stonebreaker. He's a Turing Award winner famous for his fundamental contributions to database systems like creating Postgres and more.

0

12.91 - 15.453 Ryan Peterman

What was the hardest part of that implementation?

0

15.614 - 18.718 Mike Stonebraker

Query optimizer. It's just algorithmically difficult.

0

19.339 - 22.123 Ryan Peterman

How do you identify the people who aren't smart?

0

22.205 - 27.41 Mike Stonebraker

Well, I mean, it's very easy. He shared interesting technical takes from his experience.

Chapter 3: What made Postgres special compared to other databases?

27.81 - 31.354 Mike Stonebraker

On our benchmarks, large language models get 0%.

0

32.435 - 35.638 Ryan Peterman

Why did you disagree so much with MapReduce?

0

35.658 - 38.34 Mike Stonebraker

That wasn't the only thing Google was stupid about.

0

39.401 - 62.686 Ryan Peterman

I'm curious your thoughts on unsolved problems in databases and what you think the future might look like. Here's the full episode. The first thing I want to go over is the story of how Postgres got started. But for that, I kind of want to start at the beginning. How did you get into building database systems?

0

63.167 - 98.314 Mike Stonebraker

When I graduated, I had the good fortune of being hired at Berkeley. And it was clear I had to, you know, continuing what I did for my PhD was not going to go anywhere. Then, as well as today, you're way ahead if you get adopted by a mentor who knows the ropes. So Gene Wong, who is still alive and still kicking, took me under his wing and said, well, let's do something together.

Chapter 4: Why does Mike Stonebraker disagree with Google on database solutions?

100.319 - 136.519 Mike Stonebraker

And this was 1971, which was the year after Ted Codd wrote his pioneering paper in CACM. Gene Wong said, well, let's take a look at database stuff. At the time, the competitors were a thing called the CODECIL proposal, which you're probably too young to have ever heard of. And so it was a low-level spaghetti network proposal where you executed queries by following pointers.

0

136.499 - 174.673 Mike Stonebraker

And then the alternative was the IBM proposal, which was a thing called IMS, which is still available. And it's hierarchical data. Organized your data is trees. And even at the time, IBM realized that trees were not general enough to solve many people's problems. So they hacked on a way to make it a limited network structure. So it was clear that was a horrible hack.

0

174.653 - 201.552 Mike Stonebraker

The Codacil proposal had all kinds of bad properties besides being low level and really hard to debug. It also had the property that if anything changed in your, what's now called your schema, you basically had to throw away everything and do it all again because it was absolutely rooted at the physical level. Whereas Ted Codd's stuff made perfect sense.

0

202.833 - 234.462 Mike Stonebraker

And so Gene said, well, let's build one of these puppies. That's clearly the next thing to try. So we started building Ingress in 1972. I was an assistant professor at Berkeley. As you know, if you're an assistant professor, you get five years to prove that you're a big shit. And they fire you or they give you tenure. So Ingress was my ticket to getting tenure, which happened in 1976.

0

235.201 - 237.646 Mike Stonebraker

That was where it started.

Chapter 5: What are the future problems facing databases today?

237.666 - 267.885 Mike Stonebraker

And then again, happenstance. At the time, a lot of people would build prototypes, which were sort of studenty-like code, which means you could get it to run, but if you gave it to anybody else, they couldn't. So we put in the first 90% to get something we could run. And then for whatever reason, we put in the next 90% to get it to where it really worked.

0

269.107 - 297.675 Mike Stonebraker

So the University of California version of Ingress really worked. And so over the next couple of years, about 100 universities started running it because Unix became the big thing. And so this was a database, a free database system that ran on Unix. And so it was quite popular in the academic world.

0

297.655 - 305.807 Mike Stonebraker

And so we started getting lots of visitors at Berkeley who would say, gee, this is really nifty-looking stuff.

0

Chapter 6: What advice does Mike Stonebraker have for aspiring database professionals?

306.769 - 339.137 Mike Stonebraker

What's the biggest Ingress application you have? And we'd be forced to say not very big. And so this was brought home in spades when Arizona State University considered running Ingress on their student records data, all 40,000 students worth. And they could get over that they had to get an unsupported operating system from Bell Labs.

0

340.13 - 371.882 Mike Stonebraker

They could also get over, they had to run an unsupported database system from these guys at Berkeley. But the project went down in flames when they realized there was no COBOL available for Unix. and they were a COBOL shop. So unsupported operating system, unsupported database system, no COBOL, doomed us to, you know, irrelevance. And it was clear the only way out of that was to start a company.

0

373.104 - 402.402 Mike Stonebraker

And so in 1980, we got venture capital as it existed then and started Ingress Corporation. to move Ingress to Dex VMS, a real operating system. And we had a real company that would support Ingress. And that was the start of the commercial journey.

0

403.343 - 421.453 Ryan Peterman

I saw that Ingress was competing with Larry Ellison's offering at Oracle. Yes. I saw that Ingress was certainly better than what they were offering, but they were still competing somehow. How did they compete?

0

422.108 - 460.62 Mike Stonebraker

Larry Ellison is a fabulous salesman. And he, at the time, he made present tense and future tense indistinguishable. And so he basically lied to customers. He would ship stuff that didn't work and have his initial customers help him debug it. So I think he engaged in what I consider very shady business practices. But lying to customers, I think, is unconscionable. So for instance,

462.524 - 488.602 Mike Stonebraker

There was a thing called referential integrity, which is if you fire an employee and he's the last person in a given department, do you want to delete the department or do you want to have it be a department, a ghost department? It's all that kind of stuff. And so Ingress Corporation implemented referential integrity.

489.683 - 503.584 Mike Stonebraker

Oracle Corporation wrote two manual pages that said, here's the definition of referential integrity, which everybody agreed to. And then down at the bottom it said, not yet implemented.

505.454 - 535.461 Ryan Peterman

Interesting. Yeah, I had interviewed someone who worked at Sun Microsystems and they had a similar opinion that Larry Ellison was a little bit shady. So it seems to be a commonality. I also saw somewhere else and something that you had said was that when Oracle acquired MySQL, that everyone kind of got afraid of that and moved to Postgres.

535.441 - 546.075 Mike Stonebraker

That was the genesis of Postgres replacing MySQL as the preferred open source relational database system.

Chapter 7: What is DBOS and how does it relate to modern database technology?

1209.513 - 1216.742 Ryan Peterman

Can you explain why indexing would be not as effective when there's SIMD?

0

1217.843 - 1254.482 Mike Stonebraker

So let's say I'm looking for Ryan's salary and I have a B-tree. So you go to the root of the B-tree. You find the divider that has both sides of Ryan. You follow the pointer. That's a memory access for sure. Then you do it all again, and you do this like three or four times. So that doesn't parallelize well. So the answer is indexing doesn't parallelize well.

0

1255.203 - 1274.814 Ryan Peterman

You mentioned B-trees. When you first implemented that first version of Ingress, Did you write all of that by hand? Because I imagine there's probably not some existing B-tree library or something. Yeah, we wrote, the original version of Ingress was all written from scratch. What was the hardest part of that implementation?

0

1276.998 - 1277.879 Mike Stonebraker

Query optimizer.

0

1278.64 - 1279.923 Ryan Peterman

And why was that hard?

1280.343 - 1296 Mike Stonebraker

It's tough, it's just, algorithmically difficult. It's still, if you ask most any senior database programmer what's the hardest part, they'll still say the optimizer.

1296.706 - 1321.16 Ryan Peterman

MapReduce came out at some point in the early 2000s, and it kind of took the data world by storm. People were really impressed by it. They thought Google really knows what they're doing. This is the best thing since sliced bread. But it seems like when I look at the literature and what you thought at the time, you kind of disagreed heavily. Why did you disagree so much with MapReduce?

1322.302 - 1324.485 Ryan Peterman

Well, I think...

1324.465 - 1358.286 Mike Stonebraker

there were a lot of not very enlightened people who said, Google is really smart. They must know what they're doing. And so we'll do whatever they say. And so they would engage in Hadoop or engage with Hadoop. But Hadoop is ridiculously inefficient. And so at the time, others, Dave DeWitt and others who were involved in our 2011 paper.

Chapter 8: How did Ingress compete with Oracle's database offerings?

2674.386 - 2711.966 Mike Stonebraker

And if you enhance them with RAG and all the tricks, it goes to 10%. And if you give as a prompt the from clause, in other words, all the actual tables that need to be accessed, and all the actual join clauses that need to be joined, then accuracy goes to about 35%. So the definition of this stuff is not ready for prime time and not going to be for a while, if ever. So what's the difference?

0

2713.167 - 2750.541 Mike Stonebraker

Number one, LLMs are trained on the pile. Data warehouse data is not in the pile. And there's an adage that if you haven't seen the data a couple of times before, you have no chance of regurgitating it. That's number one. Number two, query complexity on spider and bird is maybe 10 to 20 lines of SQL. Real-world data warehouses, it's 100 lines of SQL. Complexity is bigger.

0

2751.601 - 2791.547 Mike Stonebraker

Number three, the schema in Spider and Bird is clean. The table names are mnemonic, the column names are mnemonic, and there's no duplication. In data warehouses, people have materialized views all the time. It means there's redundancy. And column names are often underscore, Z, upperscore, blah. And so they're not mnemonic. That makes it a lot harder. And then they also have idiosyncratic data.

0

2791.662 - 2834.391 Mike Stonebraker

So J term is popular thing at MIT. It's a one month term in January. Not unique to MIT, but not very popular. So not in the pile. idiosyncratic data, simple queries, schema is a mess, make it not work. And those are true of every data warehouse I know of. And so I think the technology simply doesn't work and isn't going to work anytime soon. So what do you do?

0

2836.415 - 2868.514 Mike Stonebraker

Well, first of all, we published our benchmark. It's a thing called Beaver, which is an anonymized and abstracted version of these four data warehouses. And so if you think you're really good at doing text to SQL, try a real benchmark, not a fake one. So number two, borrowing from what I just said, if you don't have all the join terms and you don't have the from clause, you're toast.

2870.657 - 2897.708 Mike Stonebraker

What's more, if you don't break down the query into simpler pieces, you're toast. So that says to me that you want to give your retrieval system simpler pieces, which include the from clause and include join terms. That's number one. Number two,

2898.312 - 2930.842 Mike Stonebraker

The minute you want to talk to two different structured databases, like your data warehouse and your CRM system, then it's pretty clear to me that doing a structured data join using an LLM is a bad idea. It's just you're much better off leaving them as tables and doing a join in SQL. So our point of view is we are trying out turning everything into tables.

2931.924 - 2965.99 Mike Stonebraker

We're working with the Department of Transportation in the city of Munich, Germany. And they have six people full time who are answering citizens' complaints, queries. which are of the form, how come I don't have enough time to cross this intersection next to my house before the light turns? All kinds of stuff. How come the trolley doesn't stop for enough time for me to get on the trolley?

2966.911 - 3005.808 Mike Stonebraker

How come the trolley doesn't come more than once an hour? I mean, it's all this stuff. Their database is the trolley schedule, that's SQL. The light sequencing, that's SQL. The intersections, that's CAD. the federal country of Germany regulations of this stuff. That's text. City of Munich regulations for this stuff, which is text. So you got to join SQL, SQL, CAD, text, and text.

Comments

There are no comments yet.

Please log in to write the first comment.