Andy Halliday
๐ค SpeakerAppearances Over Time
Podcast Appearances
And also we now have very competent tools seamlessly interacting with the model
and bringing in web search real-time data to augment the pre-training data at the cutoff.
So you don't really need to worry about, oh, I'm going to not have as much recent information in the model anymore if I use Gemini with this old thing, and maybe some other model has a more recent cutoff date.
That's receding in importance in my view.
What I will be models, you know, so we got three today and we'll be hearing much more about it in the coming days as the the people who really have the time and attention to make direct comparisons with all the other major models will start doing that work for us.
But yesterday or maybe the day before Grok 4.1 was released by XAI.
And it had already gone under the name Quasar Flux on LM Arena.
It had already gone to the number one ranking overall for user preference.
So now the question is, in LM Arena, will Gemini 3.0 bump past Grok 4.1?
Now, I don't use Grok at all.
And I've never understood why I would.
But here's some interesting points about Grok 4.1's release.
So it achieved the highest emotional intelligence score among tested systems, optimizing for personality traits like empathy and conversational tone.
So it has that going for it.
It reduced its own prior model hallucination rate from 12% to 4%.
and cut factual errors by 66% compared to the prior version of Grok.
And it also saw a significant upgrade in creative writing tasks ranking just behind chat GPT 5.1 on creative writing V3 benchmark.
So it's really a state of the art model in many respects and was at the top of the LM Arena leaderboard in effect for user preference on the responses that people were getting in this kind of blind comparison that you get on the LM Arena.
Yeah, it was there at the top.
Somebody took a screenshot of it, I'm sure.