Marco Arment
👤 SpeakerAppearances Over Time
Podcast Appearances
He says, overall, we found no evidence of formal reasoning in language models, including open source models like Lama, Phi, Gamma, Mistral, and the leading closed source models like the recent OpenAI GPT-4.0 and O1 series. Their behavior is better explained by sophisticated pattern matching. So fragile, in fact, that changing names can alter results by up to 10%.
He says, overall, we found no evidence of formal reasoning in language models, including open source models like Lama, Phi, Gamma, Mistral, and the leading closed source models like the recent OpenAI GPT-4.0 and O1 series. Their behavior is better explained by sophisticated pattern matching. So fragile, in fact, that changing names can alter results by up to 10%.
this is from the quoting from the paper the performance of all models declines when only the numerical values in the question are altered like they're asking in math problems like word math problems and if you like change the name of the kids in the word problem it gets the answers wrong right change sometimes changing the numbers right changing it from like one to five or whatever it will get the answer right with one and wrong with five uh or whatever
this is from the quoting from the paper the performance of all models declines when only the numerical values in the question are altered like they're asking in math problems like word math problems and if you like change the name of the kids in the word problem it gets the answers wrong right change sometimes changing the numbers right changing it from like one to five or whatever it will get the answer right with one and wrong with five uh or whatever
uh back to the thread on x we can scale uh data parameters and compute or use better training data for five four llama four and gpt5 but we believe this will result in better and pattern matchers not necessarily better reasoners uh and there's more on the same paper from gary marcus we'll link to his blog post in the show notes as well uh i think anybody who knows anything about how the work would have said oh of course it's not doing reasoning it's just you know it's
uh back to the thread on x we can scale uh data parameters and compute or use better training data for five four llama four and gpt5 but we believe this will result in better and pattern matchers not necessarily better reasoners uh and there's more on the same paper from gary marcus we'll link to his blog post in the show notes as well uh i think anybody who knows anything about how the work would have said oh of course it's not doing reasoning it's just you know it's
Speaking of spicy, Marco loves the memes with spicy. Spicy autocomplete is one of the things people call LLMs, that it's much more like compressing data and searching it, compressing textual data or whatever and searching it than it is like any kind of reasoning thing. That's how they work on the inside. but you can't just assume because that's how... Well, everyone knows that LLMs don't think.
Speaking of spicy, Marco loves the memes with spicy. Spicy autocomplete is one of the things people call LLMs, that it's much more like compressing data and searching it, compressing textual data or whatever and searching it than it is like any kind of reasoning thing. That's how they work on the inside. but you can't just assume because that's how... Well, everyone knows that LLMs don't think.
Look at how they work on the inside. That's not thinking, right? In scientific endeavors, even if it's something that you think, quote-unquote, everybody knows, okay, then prove it. And how do you prove it? Devise a way to test for the thing that you think may or may not be true, run the test, and publish a scientific paper about it. That's how this works. And even for things that are like...
Look at how they work on the inside. That's not thinking, right? In scientific endeavors, even if it's something that you think, quote-unquote, everybody knows, okay, then prove it. And how do you prove it? Devise a way to test for the thing that you think may or may not be true, run the test, and publish a scientific paper about it. That's how this works. And even for things that are like...
boring like you know making a paper about something that oh everybody knows that it's common sense well common sense is not proof you have to actually test the idea you need something you need an idea that is falsifiable and then you need to test it and then people can argue did they test what they think they were testing Can I do a better paper? This is the scientific process.
boring like you know making a paper about something that oh everybody knows that it's common sense well common sense is not proof you have to actually test the idea you need something you need an idea that is falsifiable and then you need to test it and then people can argue did they test what they think they were testing Can I do a better paper? This is the scientific process.
So I love seeing this because it is something that people talk that I've certainly talked about is like, oh, well, everyone knows that they're not really thinking they don't have any kind of reasoning or whatever. But you can't just make that assumption.
So I love seeing this because it is something that people talk that I've certainly talked about is like, oh, well, everyone knows that they're not really thinking they don't have any kind of reasoning or whatever. But you can't just make that assumption.
You have to actually test it and you have to actually come up with a way that you think correctly tests for the thing you think you're testing for. And I'm sure there will be follow up papers to say, well, actually, this paper didn't quite get at the heart of what the problem is or whatever. So you can you can look at it.
You have to actually test it and you have to actually come up with a way that you think correctly tests for the thing you think you're testing for. And I'm sure there will be follow up papers to say, well, actually, this paper didn't quite get at the heart of what the problem is or whatever. So you can you can look at it.
It's very readable if you just look at the examples and look at the things that they They did like, you know, giving it word problems and saying, OK, but if I change the boy's name from Billy to Timmy, now it gets it wrong. That is probably a good sign that it is not logically reasoning about this math problem, but is instead, you know, spicy autocomplete.
It's very readable if you just look at the examples and look at the things that they They did like, you know, giving it word problems and saying, OK, but if I change the boy's name from Billy to Timmy, now it gets it wrong. That is probably a good sign that it is not logically reasoning about this math problem, but is instead, you know, spicy autocomplete.
And because it is just pattern matching and doesn't understand the significance of any of these different things, changing the name is like, well, different pattern match. And I got the wrong answer because these things have no idea what math is. And that's just not the way they work internally.
And because it is just pattern matching and doesn't understand the significance of any of these different things, changing the name is like, well, different pattern match. And I got the wrong answer because these things have no idea what math is. And that's just not the way they work internally.