Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing
Transcription

Transcript generated automatically by AI and may contain errors.

Chapter 1: What are the first impressions of GPT-5.4?

0.031 - 25.965 Nathaniel Whittemore

Today on the AI Daily Brief, GPT-5.4 is here and these are both the first impressions from the broader world as well as my first test results. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, AIUC, Blitzy, and PromptQL.

0

26.485 - 45.009 Nathaniel Whittemore

To get an ad-free version of the show, go to patreon.com slash aiDailyBrief, or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us a note at sponsors at aiDailyBrief.ai. Lastly, as often happens, when we have a big new model release, no headlines today. We are just going to spend all of our time on this exciting new model.

0

45.429 - 47.672 Nathaniel Whittemore

So without any further ado, let's dive in.

0

49.002 - 63.08 Unknown

After a couple of weeks where mostly we've been talking about big macro issues like the Pentagon and Anthropic and all of that sort of thing, we finally have the cool fresh breeze of a new exciting model to test. And this one indeed is pretty exciting.

0

63.881 - 77.558 Unknown

Ethan Malek tweeted, I think we've been through enough release cycles for models at this point to say that the latest model from OpenAI or Anthropic or Google is generally going to be the best model in the world upon release, with some jagged edges, until the next release by one of the big three.

77.842 - 90.36 Unknown

Now with that background, a different way to look at where we've been is that it's simply been OpenAI's turn. However, the expectations coming into GPT-5.4 were a little bit higher than they might have been for some of OpenAI's more recent releases.

91.161 - 109.569 Unknown

Ever since the release of GPT-5, all the big model providers got the memo that trying to promise too much in each update, rather than just being very incremental, was a pretty scary proposition. That's what got us the 5.1, 5.2, 5.3, now 5.4 kind of paradigm. But of course, it's not just OpenAI doing that. Google and Anthropic are both on that same plan as well.

110.31 - 123.897 Unknown

And yet, even with that, 5.4 has had a little bit more hype and anticipation around it than some of the previous iterative models that we've gotten more recently. This was theoretically supposed to be the big outcome of OpenAI's Code Red, which was launched back in December.

Chapter 2: How does GPT-5.4 compare to previous models like GPT-5.3?

123.877 - 137.532 Unknown

And what's more, the buzz for the last week or week and a half or so has been that this one was really meaningful. Enough so that it almost felt to me like some of the more recent leaks to publications like The Information were almost trying to tamp down on expectations.

0

137.68 - 155.445 Unknown

To take one example, rumors had been flying that there was a 2 million token context window, whereas the informations reporting from the last couple of days suggested it was just 1 million. Seemed to me to be a little bit of expectation setting through leaks. In any case, on Thursday afternoon we actually got the model. and the initial buzz was strong.

0

155.987 - 169.999 Unknown

Ben Hylack wrote, I've been using GPT-5.4 for the past few weeks. In a sea of endless model drops and benchmark maxing, this model is the first in a long time to be worth your time to try. Honestly, didn't expect OpenAI to pull this off.

0

170.536 - 185.203 Unknown

So let's talk first about how OpenAI frames things, look at some of the early reactions in the community, and then we'll walk through a more comprehensive case study with a project that I recently did to put the new capabilities through the wringer. Now, one of the interesting things about this week is that this was not the first new OpenAI model we got.

0

185.644 - 199.447 Unknown

Just a couple of days ago, we got GPT-5.3 instant, although OpenAI started promising almost immediately that 5.4 was coming. 5.3 Instant was, as we've talked about, a speed and personality play. The announcement tweet called it more accurate, less cringe.

200.209 - 214.752 Unknown

This actually was part of the inspiration for our episode about what's going to actually matter in consumer AI, as this was so clearly aimed at that default sort of experience that the average ChatGPT user is going to have because they're not optimizing the model selector for what they think they want.

214.732 - 228.992 Unknown

GPT-5.4, although still having a lot of offshoots, does feel like they're trying to bring together their models under a coherent banner. They write, GPT-5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model.

229.633 - 247.184 Unknown

It incorporates industry-leading coding capabilities of GPT-5.3 codecs, while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently, delivering what you ask for with less back and forth.

247.925 - 255.623 Unknown

In other words, if 5.3 Instant was for the personal use cases, GPT-5.4 was, as the subheader says, designed for professional work.

Chapter 3: What are the significant features of GPT-5.4 for professional tasks?

255.738 - 275.232 Unknown

As the information had it in that leak, we do indeed get a 1 million token context window, which, among other things, improves 5.4's ability on tasks that require longer thinking. The quotes from early testers that they included were ravishing, even more than the ones that they usually include. Brendan Foody, the CEO at Mercore, writes, ''GPT 5.4 is the best model we've ever tried.

0

275.773 - 290.333 Unknown

It's now top of the leaderboard on our Apex Agents benchmark, which measures model performance for professional services work.'' It excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models.

0

291.155 - 310.468 Unknown

Indeed, these professional tasks are some of the big focus for OpenAI's announcement blog. They compare GPT-5.4's outputs to GPT-5.2's on spreadsheets, documents, and presentations, all with significant upgrades. And if overall knowledge work is what OpenAI chose to focus on, there was also a big emphasis on computer use capabilities.

0

311.289 - 323.106 Unknown

We are, of course, living in an open-claw world, and so it makes sense that this is becoming increasingly important. Computer use is actually listed above coding, perhaps because the jump from 5.3 codecs to 5.4 isn't all that big.

0

323.086 - 337.95 Unknown

Indeed, they basically said that 5.4 is the integration of 5.3 Codex with other improved aspects of the model, meaning it seems to me that their latest round of big coding innovations was embedded in 5.3 Codex itself. There's a sub-theme running throughout the announcement about efficiency.

338.31 - 357.362 Unknown

In the first part of the announcement post, they write that GPT-5.4 is our most token-efficient reasoning model, using significantly fewer tokens to solve problems when compared to GPT-5.2, translating to reduced token usage and faster speeds. In the coding section, they also talk about fast mode in Codex, which they say delivers up to 1.5x faster token velocity.

357.342 - 373.179 Unknown

They write, It's the same model and same intelligence, just faster. That means users can move through coding tasks, iteration, and debugging while staying in flow. The new efficiency gains also show up in tool search. They write, Previously, when a model was given tools, all tool definitions were included in the prompt up front.

373.599 - 389.968 Unknown

For systems with many tools, this could add thousands or even tens of thousands of tokens to every request, increasing cost, slowing responses, and crowding the context with information the model might never use. With tool search, GPT-5.4 instead receives a lightweight list of available tools along with a tool search capability.

Chapter 4: How does GPT-5.4 improve coding efficiency?

390.449 - 407.211 Unknown

When the model needs to use a tool, it can look up that tool's definition and append it to the conversation at that moment. This approach, they say, dramatically reduces the number of tokens required. They evaluated 250 tasks from Scales MCP Atlas and found that this new configuration had the same accuracy but reduced total token usage by 47%.

0

407.812 - 425.357 Unknown

That combined with improved accuracy on tool calling obviously makes it really appealing for agentic use cases. There's a few other parts of the announcement, including improved web search and improved steerability, but those are the big hits. And as people dug in, there were a few things that stood out. First of all, that efficiency is showing up in people's tests.

0

425.858 - 440.033 Unknown

Greg Kamrat, the president of ArcPrize, said that on Arc AGI 2, they were seeing a consistent 20 percentage point lift versus 5.2 at the same price. Still, the three benchmarks that people were most discussing were around coding, computer use, and GDP value.

0

440.468 - 453.49 Unknown

On coding, people flagged that GPT-5.4 was only nominally better than 5.3 codecs on benchmarks like SweetBench Pro, but most people understood that that wasn't the core value proposition of this updated model. The computer use improvement got more of a discussion.

0

453.991 - 461.143 Unknown

This is maybe the most concerned I've ever seen people with computer use, and I think that's the difference between the pre- and post-OpenClaw world.

461.123 - 481.542 Unknown

Now that people have got all these Mac minis running around with their OpenClaw agents having pretty much unfettered access to them, how good the models are at using the computer becomes much more relevant in a day-to-day way, not just a theoretical way. Rahul Agrawal writes, GPT-5.4 is here and it can use a computer better than a human? OpenAI shipped GPT-5.4 on March 5th.

482.023 - 496.007 Unknown

The headline isn't the reasoning improvements. It's that this is their first general purpose model with native state-of-the-art computer use. It can operate websites and software autonomously, issue keyboard and mouse commands, write and execute code, and navigate full desktop environments.

495.987 - 518.738 Unknown

On OS World Verified, it hit 75%, which is above human-level performance at 72.4%, and a massive jump from GPT 5.2's 47.3%. That's not incremental, that's a step change. When agents can reliably navigate desktops, the bottleneck on automation shifts from can the model do it to do you trust it enough to let it? That's the question nobody has a good answer to yet.

519.308 - 541.461 Unknown

Jamie Cuff from Pace wrote an X article about this specifically. He called it, We stress-tested GPT-5.4 on the hardest UI on the internet, and writes, People don't realize how good AI computer use has actually gotten, until they see it tackle the hardest UIs in existence, legacy insurance portals. At Pace, we build AI agents for insurance workflows like submission intake and first notice of loss.

Chapter 5: What are the community reactions to GPT-5.4's performance?

619.235 - 635.209 Unknown

That's just wins when it comes to ties. That number rises to 82-83%. Ethan Malek pointed out what this means in terms of time savings. He wrote, given the GDP-VAL benchmark for GPT-5.4, the new model ties or beats humans as judged by other experts at professional tasks 82% of the time.

0

635.77 - 650.732 Unknown

If you give a seven-hour task to AI, even with failure rates and the need to check results, you'd save four hours and 38 minutes on average. And in addition to the general performance increases across the GDP VAL set, it's very clear that OpenAI is aggressively going after certain industries even more.

0

651.493 - 672.84 Unknown

COO Brad Lightcap, for example, tweeted, The team worked extremely hard to make GPT-5.4 great for finance. It's much improved for financial modeling and analysis, integrates directly into Excel, and connects to Factivia, Dilupa, S&P Global, and many more. It does feel like a Codex moment is coming here. But what about the overall impressions outside of the individual benchmarks?

0

673.242 - 692.821 Unknown

Leighton Space wrote, We've learned to take for granted that OpenAI is the smartest kid in the room, always reporting state-of-the-art evals, but this set of updates feels much more substantial and confident than any OpenAI launch in recent history. The Every vibe check summed it up, OpenAI is back. Three months ago, writes the team at Every, OpenAI was losing the agent to coding race.

0

693.402 - 709.25 Unknown

Cloud Code had captured developers' hearts and Opus 4.5 was shipping at a level other models couldn't touch. Meanwhile, OpenAI's coding agent Codex felt like it was built for an older era of coding with AI. who is precise but rigid, powerful but personality-less, and not good with tools or able to run for long periods of time autonomously.

709.771 - 729.735 Unknown

OpenAI's latest model release, GPT-5.4, along with their other recent releases, GPT-5.3 Codex, GPT-5.3 Codex Spark, and the Codex Desktop app, shifts the competitive balance back towards OpenAI on the coding front. The new model produces plans that are thorough and technically precise and have a user focus and human feel that has been missing from OpenAI's previous coding models.

730.576 - 752.085 Unknown

In our testing, GPT-5.4 reviews code with more depth than GPT-5.3 Codex and has a noticeably more conversational voice. With a few tweaks, it became our preferred model to use in our open clause, especially given that it is half the price of Opus 4.6. Even Kieran Claussen, our diehard Cloud Code devotee, is now reaching for GPT-5.4 daily since we started testing it a week ago.

752.251 - 767.289 Unknown

As ever, they say, there are trade-offs. GPT-5.4 has a tendency to expand the task well beyond what you ask for and to call tasks done before they're finished. It sometimes completed tasks in obviously wrong ways, then lied about it. The bigger story here is OpenAI's trajectory.

768.03 - 787.113 Unknown

From the Codex desktop app to GPT-5.3 Codex and to GPT-5.4, the company is iterating fast and many members of the team now use its tools and models daily for coding, a significant change from a few months ago. Couple of things that they said they liked about it. It did proactive research without being asked. It had a more human voice than previous codexes.

Chapter 6: How does GPT-5.4 enhance computer use capabilities?

854.514 - 874.959 Unknown

First model where I've experienced the true go-investigate-and-fix experience that feels robust. What needs work, basic front-end and UX taste, still loves a bullet point, latency, and stability. The next day she updated it and said she's, quote, starting to have loving feelings towards the model. Basically, it did some really difficult tech and data work really well without a lot of support.

0

875.741 - 892.946 Unknown

Matt Schumer, who you might remember from Something Big is Happening, calls GPT-5.4, in short, the best model in the world by far. On coding, he writes, coding capabilities are ridiculous. It's essentially flawless. Inside Codex, it's insanely reliable. Coding is essentially solved. There's not much more to say on this. It's just that good.

0

893.767 - 911.807 Unknown

Now, he did find weaknesses, including front-end taste, which is something that I'm going to get into in just a moment as well. Mark Tenenholz from Perplexity also pointed out that while the model itself is great, the updates to the actual Codex CLI experience are really good as well. He called them the real hero. So much less friction than the previous approval system, he says.

0

916.765 - 927.658 Unknown

Agendic AI is powering a $3 trillion productivity revolution, and leaders are hitting a real decision point. Do you build your own AI agents, buy off the shelf, or borrow by partnering to scale faster?

0

928.219 - 943.617 Nathaniel Whittemore

KPMG's latest thought leadership paper, Agendic AI Untangled, Navigating the Build, Buy, or Borrow Decision, does a great job cutting through the noise with a practical framework to help you choose based on value, risk, and readiness, and how to scale agents with the right trust, governance, and orchestration foundation.

943.597 - 964.592 Unknown

Again, that's www.kpmg.us slash navigate. There's a new standard that I think is going to matter a lot for the enterprise AI agent space. It's called AIUC1, and it builds itself as the world's first AI agent standard.

965.073 - 974.768 Unknown

It's designed to cover all the core enterprise risks, things like data and privacy, security, safety, reliability, accountability, and societal impact, all verified by a trusted third party.

974.748 - 987.205 Unknown

One of the reasons it's on my radar is that Eleven Labs, who you've heard me talk about before and is just an absolute juggernaut right now, just became the first voice agent to be certified against AIUC1 and is launching a first-of-its-kind insurable AI agent.

987.666 - 1004.569 Unknown

What that means in practice is real-time guardrails that block unsafe responses and protect against manipulation, plus a full safety stack. This is the kind of thing that unlocks enterprise adoption. When a company building on Eleven Labs can point to a third-party certification and say our agents are secure, safe, and verified, that changes the conversation.

Chapter 7: What challenges were faced during testing GPT-5.4?

1067.96 - 1077.351 Nathaniel Whittemore

But the bottleneck is always the same. The data isn't ready. It's scattered. It's messy. Definitions aren't clear. You're waiting on your data team or waiting on domain experts for clarification and confirmation.

0

1077.772 - 1085.081 Unknown

That's the bottleneck today's sponsor, PromptQL, is built to break. PromptQL is a trusted AI analyst for high-frequency decision-making.

0

1085.462 - 1091.171 Nathaniel Whittemore

It connects across warehouses, databases, SaaS, and internal APIs. No massive data prep or centralization required.

0

1091.752 - 1102.508 Unknown

It's built for multiplayer input. Teammates can jump into a thread, correct assumptions and nuance, flag edge cases. PromptQL turns everyday conversations into a shared context. And if something is ambiguous, it doesn't guess.

0

1102.989 - 1113.806 Nathaniel Whittemore

It escalates to the right expert, captures the correct logic, and gets it right next time. That's how it delivers trust and accuracy. Over time, PromptQL specializes to your business, like that veteran employee who just knows things.

1114.387 - 1134.306 Unknown

From simple what-is questions to complex what-if scenarios, you can model impact and stress test decisions before you commit, all through a simple natural language prompt. PromptQL, the trusted AI analyst for teams with shared context and messy data. So as we round the corner, let's actually shift into the test that I did.

1135.108 - 1151.398 Unknown

I wanted to do something that was actually kind of difficult, and that was also real. I've been building a bunch of agents recently with Cloud Code, and the specific thing that I was interested in building with Codex and 5.4 was some type of experience to help people show off their agent building and agent orchestration skills.

1151.648 - 1168.594 Unknown

Right now, you know that we have a couple of different self-directed experiences for people to learn how to use tools like OpenClaw. And one of the things that I've been recognizing around that is that this new skillset of agent building and agent orchestration is going to be more and more in demand from all sorts of different types of organizations.

1168.574 - 1186.887 Unknown

However, it is really this new types of skill set. It's not easy to describe necessarily. It's not easy to show necessarily. And the people who are going to be hiring or contracting for it aren't necessarily going to be really easily able to describe exactly even what they're looking for. They're just going to want people who can help them agentify things.

Chapter 8: What are the overall recommendations for using GPT-5.4?

1470.822 - 1488.819 Unknown

Now, I will say that once we honed in on the core idea, it was pretty smart and was able to stay focused on the things that made what we were trying to build different. But I was still having real problems trying to get it to go into do the actual work mode. Even when I tried to shift it into visualization mode, it kept trying to plan the visualization rather than just visualize it.

0

1489.42 - 1498.669 Unknown

Again, I finally had to say, why aren't you just designing it? Claude would have showed me five versions by now. It responded, fair hit. I stayed too long in abstraction when I should have started showing.

0

1498.649 - 1515.662 Unknown

And then you would think that at that point it would start doing the showing, but instead it said, here are five concrete interface directions for builder rep, all built around blah, blah, blah, blah, blah, blah, blah. I had to literally stop it and say, no, I'm not saying describe it. I'm saying go build the clickable prototype, which it finally did, but then had its own problem.

0

1516.303 - 1533.752 Unknown

It was just awful visually. And this is something that lots of other folks pointed out. Ben Davis, who works with Theo on his YouTube channel about AI, said, it is hilariously bad at UI stuff. Matt Schumer said front-end taste is far behind Opus 4.6 and Gemini 3.1 Pro. Why is this so hard to fix?

0

1534.222 - 1555.353 Unknown

I'm not trying to rip on it too much here, but it is just honestly staggering how bad and tasteless the UI design is in my little experience. I had to bring it back over to Claude and do the front end design there. Claude pulled no punches in critiquing what we had gotten from 5.4. The card backgrounds are muddy gradient blobs. The colors are dull and washed out. The typography has no hierarchy.

1555.533 - 1575.301 Unknown

The tags look cheap. The cards have no breathing room. The whole thing looks like a dark mode template from 2023. Brutal, but all very true. Still after a while, we use Claude to get the design under control and we're off to the races inside the Codex CLI. And this is where things start to turn around and where you start to see why many folks are so excited about this new model.

1576.223 - 1595.528 Unknown

There were some not perfect things about the experience with Codex. It had a couple weird foibles, like it pushed me to use GPT 4.1, but I found similar weirdness like that with Claude Code as well. I confirmed that it wasn't just 5.4 in ChatGPT, but also in Codex that was not good at design. But there were some parts of the experience that were great, including compared to Claude Code.

1596.189 - 1612.831 Unknown

The thing that Mark Tenenholz was talking about, with much less friction in the approval system, was absolutely true. There are so much fewer confirmations with Codex right now than Claude Code in ways that make the experience just massively, massively better. What's more, Codex did a much better job of letting you know what was going on as it was building.

1613.433 - 1628.843 Unknown

It actually has not just reasoning traces, but almost interstitial updates around what it's doing when it's doing a long-running task. This means that when it's doing something that takes 5 or 6 or 8 or 10 minutes, it's not just a total black box, you actually have a sense of where it is in its process as it's happening.

Comments

There are no comments yet.

Please log in to write the first comment.