Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
So these are features that APIs are shipping, which is like, prompt caching, pre-filling, because you can drive prices down and you can make APIs much faster. If you know you're going to keep, if you run a business and you're going to keep passing the same initial content to Cloud's API, you can load that in to the Anthropic API and always keep it there.
So these are features that APIs are shipping, which is like, prompt caching, pre-filling, because you can drive prices down and you can make APIs much faster. If you know you're going to keep, if you run a business and you're going to keep passing the same initial content to Cloud's API, you can load that in to the Anthropic API and always keep it there.
But it's very different than we're kind of leading to the reasoning models, which we talked, we showed this example earlier and read some of this kind of mumbling stuff. And what happens is that the output context length is so much higher.
But it's very different than we're kind of leading to the reasoning models, which we talked, we showed this example earlier and read some of this kind of mumbling stuff. And what happens is that the output context length is so much higher.
But it's very different than we're kind of leading to the reasoning models, which we talked, we showed this example earlier and read some of this kind of mumbling stuff. And what happens is that the output context length is so much higher.
And I mean, I learned a lot about this from Dylan's work, which is essentially, as the output length gets higher, you're writing this quadratic in terms of memory used. And then the GPUs that we have, effectively, you're going to run out of memory, and they're all trying to serve multiple requests at once.
And I mean, I learned a lot about this from Dylan's work, which is essentially, as the output length gets higher, you're writing this quadratic in terms of memory used. And then the GPUs that we have, effectively, you're going to run out of memory, and they're all trying to serve multiple requests at once.
And I mean, I learned a lot about this from Dylan's work, which is essentially, as the output length gets higher, you're writing this quadratic in terms of memory used. And then the GPUs that we have, effectively, you're going to run out of memory, and they're all trying to serve multiple requests at once.
So they're doing this batch processing where not all of the prompts are exactly the same, really complex handling. And then as context length gets longer, there's this, I think you call it critical batch size, where your ability to serve So how much you can parallelize your inference plummets because of this long contract.
So they're doing this batch processing where not all of the prompts are exactly the same, really complex handling. And then as context length gets longer, there's this, I think you call it critical batch size, where your ability to serve So how much you can parallelize your inference plummets because of this long contract.
So they're doing this batch processing where not all of the prompts are exactly the same, really complex handling. And then as context length gets longer, there's this, I think you call it critical batch size, where your ability to serve So how much you can parallelize your inference plummets because of this long contract.
So your memory usage is going way up with these reasoning models, and you still have a lot of users. So effectively, the cost to serve multiplies by a ton.
So your memory usage is going way up with these reasoning models, and you still have a lot of users. So effectively, the cost to serve multiplies by a ton.
So your memory usage is going way up with these reasoning models, and you still have a lot of users. So effectively, the cost to serve multiplies by a ton.
for your kv cache that you end up not being able to run uh a certain number of you know uh you know your sequence length is capped or the number of users let's say the model so this is this is showing for a 405b model and batch size 64 llama 3145b yeah and batch size is crucial to essentially they just like you want to have higher batch size to parallelize parallel your throughput.
for your kv cache that you end up not being able to run uh a certain number of you know uh you know your sequence length is capped or the number of users let's say the model so this is this is showing for a 405b model and batch size 64 llama 3145b yeah and batch size is crucial to essentially they just like you want to have higher batch size to parallelize parallel your throughput.
for your kv cache that you end up not being able to run uh a certain number of you know uh you know your sequence length is capped or the number of users let's say the model so this is this is showing for a 405b model and batch size 64 llama 3145b yeah and batch size is crucial to essentially they just like you want to have higher batch size to parallelize parallel your throughput.
Let's go into deep seek again. So we're in the post deep seek R1 time, I think. And there's two sides to this market watching how hard it is to serve it. On one side, we're going to talk about DeepSeek themselves. They now have a chat app that got to number one on the App Store. Disclaimer, number one on the App Store is measured by velocity.
Let's go into deep seek again. So we're in the post deep seek R1 time, I think. And there's two sides to this market watching how hard it is to serve it. On one side, we're going to talk about DeepSeek themselves. They now have a chat app that got to number one on the App Store. Disclaimer, number one on the App Store is measured by velocity.
Let's go into deep seek again. So we're in the post deep seek R1 time, I think. And there's two sides to this market watching how hard it is to serve it. On one side, we're going to talk about DeepSeek themselves. They now have a chat app that got to number one on the App Store. Disclaimer, number one on the App Store is measured by velocity.