The Neuron: AI Explained

AI Inference: Why Speed Matters More Than You Think (with SambaNova's Kwasi Ankomah)

07 Oct 2025

53 min

10020 words

4 speakers

07 Oct 2025

Audio

Description

Everyone's talking about the AI datacenter boom right now. Billion dollar deals here, hundred billion dollar deals there. Well, why do data centers matter? It turns out, AI inference (actually calling the AI and running it) is the hidden bottleneck slowing down every AI application you use (and new stuff yet to be released). In this episode, Kwasi Ankomah from SambaNova Systems explains why running AI models efficiently matters more than you think, how their revolutionary chip architecture delivers 700+ tokens per second, and why AI agents are about to make this problem 10x worse.💡 This episode is sponsored by Gladia's Solaria - the speech-to-text API built for real-world voice AI. With sub-270ms latency, 100+ languages supported, and 94% accuracy even in noisy environments, it's the backbone powering voice agents that actually work. Learn more at gladia.io/solaria🔗 Key Links:• SambaNova Cloud: https://cloud.sambanova.ai• Check out Solaria speech to text API: https://www.gladia.io/solaria• Subscribe to The Neuron newsletter: https://theneuron.ai🎯 What You'll Learn:• Why inference speed matters more than model size• How SambaNova runs massive models on 90% less power• Why AI agents use 10-20x more tokens• The best open source models right now• What to watch for in AI infrastructure➤ CHAPTERSTimecode - Chapter Title0:00 - Intro2:14 - What is AI Inference?3:19 - Why Inference is the Real Challenge9:18 - A message from our sponsor, Gladia Solaria10:16 - The 95% ROI Problem Discussion13:47 - SambaNova's Revolutionary Chip Architecture15:19 - Running DeepSeek's 670B Parameter Models18:11 - Developer Experience & Platform21:26 - AI Agents and the Token Explosion24:33 - Model Swapping and Cost Optimization31:30 - Energy Efficiency 10kW vs 100kW36:13 - Future of AI Models Bigger vs Smaller39:24 - Best Open Source Models Right Now46:01 - AI Infrastructure Next 12 Months47:09 - Agents as Infrastructure50:28 - Human-in-the-Loop and Trust52:55 - Closing and ResourcesArticle Written by: Grant HarveyHosted by: Corey Noles and Grant HarveyGuest: Kwasi AnkomahPublished by: Manique SantosEdited by: Adrian Vallinan

Featured

Grant Harvey

Unknown

Corey Noles

Kwasi Ankomah

Topics

Nvidia Mistral Kwasi Ankomah DeepSeek models Maverick RDU chip architecture AutoGen Samba Nova

Transcription

Full Episode

0.031 - 22.478 Grant Harvey

All right. Hello, and welcome to the Neuron podcast. Today, we're talking to Kwasi Onkoma. Kwasi is the lead AI architect at Samba Nova Systems, where he specializes in agentic AI and solving the critical challenge of making AI models run fast enough for real-world production applications using Samba Nova's revolutionary RDU chip architecture.

22.458 - 41.108 Grant Harvey

So we thought he would be the perfect guest for the Neuron podcast. Just a quick FYI about Samba Nova Systems. Samba Nova builds custom chips, systems, and platforms that let organizations train and run large AI models more efficiently than with standard hardware.

46.555 - 58.493 Kwasi Ankomah

Hi, Kwasi. Welcome to the show. How's it going? Hi, folks. How are you doing? Hi, Grant. How are you? I'm doing really well and super excited to talk to you folks about AI inference and agents. So, yeah, super excited. Awesome.

58.533 - 64.663 Corey Noles

We're excited to have you here. It's an interesting time and sounds like you guys are doing some neat work.

65.043 - 85.911 Kwasi Ankomah

Yeah, definitely. We've been kind of seeing... a lot of shift in the market. You know, we had this kind of huge focus on training. I think everyone did about how to train these large language models. And now we've kind of seen that around, you know, the biggest bottleneck that we've got now is inference, right? So how do we make things, how do we make inference fast? How do we make it scalable?

85.891 - 106.116 Kwasi Ankomah

So we've been really focusing on our architecture in speeding that up and making it more efficient and delivering these solutions to our customers. And my team really focuses on the agentic side of things, which is what I'm super excited to get into, because that is showing why inference matters and all of these calls and the number of tokens is going up.

106.176 - 111.443 Kwasi Ankomah

And that's a really interesting area as well. So, yeah, that's where we're trying to focus on at the moment. Yeah.

111.423 - 131.457 Grant Harvey

Well, I got to ask, okay, so let's just clarify. So very simple, before we get to agents, for our readers and listeners who use ChatGPT daily, maybe don't think about what's happening under the hood. So when you type a prompt into ChatGPT or any other AI and hit enter, what actually happens? Like what is inference in plain English? Yeah.

131.437 - 153.681 Kwasi Ankomah

Yeah, so inference is coming from the word to infer. So it's the model going along and then making a prediction of some sort. So it's taking your input and then it's basically doing the thing that large language models do, which is the next token. And that is the actual process of inference. It goes in, it runs through the model and we get an output and that output

The Neuron: AI Explained

AI Inference: Why Speed Matters More Than You Think (with SambaNova's Kwasi Ankomah)

Full Episode

Sign in to Audioscrape

Share this moment

The Neuron: AI Explained

AI Inference: Why Speed Matters More Than You Think (with SambaNova's Kwasi Ankomah)

Full Episode

Want to see the complete chapter?

Sign in to Audioscrape

Share this moment