Andy Halliday
๐ค SpeakerAppearances Over Time
Podcast Appearances
with a new technology that they've put forward called sparse attention.
So as you know, in transformer models, the attention mechanism is what makes it possible for you to throw just any kind of text, all these words, many of which are
you know, in human language, not that important.
All the ands, thes, and is, and all those things, you know, those can be ignored to a large degree.
And the attention mechanisms determine which tokens are actually important to the reasoning task that's there.
Well, sparse attention uses this concept of, you know, cutting back the total number of tokens that are necessary by selectively
Using the ones that have the most import to the inference process.
So the new sparse attention, deep seek sparse attention, they call it that, is an advanced attention mechanism that selectively prioritizes relevant input data in your prompt.
So it reduces computational overhead.
and maintains high accuracy across longer contexts.
Because obviously, if you can compress to the most relevant using selective sparse attention, then you can put a much larger prompt through that gives you a larger context.
delivery by the user and then you if you can maintain accuracy and reasoning across that longer context that's improvement in your performance so it allows the mom to process large data sets sufficiently and it allows for larger context but uh they're claiming that they beat gemini 3.0 pro but they didn't beat gemini 3.0 pro thinking
Okay, so they're saying, oh, here, our thinking model, the speciality model, beats Gemini 3.0 Pro.
Well...
yeah okay it does beat that marginally but still the state of the art is currently gemini 3.0 pro thinking okay so you give give gemini 3.0 pro just a little more budget uh not not using this sparse attention mechanism that we know of necessarily but you give a little more budget for thinking and it is clearly the winner that way
But anyway, I thought it was very interesting, this sparse attention approach.
It's a concept that's being used in many ways, especially in the context of mixture of experts models, where the sparsity is you're going to activate only those areas of the model first.
that are relevant experts to the process of inference for the prompt that's been given.
So making things more efficient, generally speaking, and DeepSeq has moved that forward.
Not to cut down a deep seek, they're really an amazing open source model and being used widely in enterprise in the United States and around the world as the foundation model for any kind of fine tuning and creation of a customized LLM that's operating in the enterprise context.