Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think last time we were discussing on the order of 10,000 people.
So basically...
everybody in the world, or almost everybody in the world, or the variance we see between different humans today, was latent in this group, which sort of seems... And I get your point that, well, if you just stack up different things across the genome, then stacking them up really has a big effect.
But it's interesting that we have so many different groups in the world today, and all that diversity...
It comes from very small population size.
Fruso has an amazing ML infra team that keeps finding clever ways to squeeze more performance out of their hardware.
For example, tokenization has become a real bottleneck for agentic workloads.
Agentic prompts are often extremely long.
They tend to have high KV cache rate rates, which shrinks the GPU's pre-fill work.
This means that the tokenization step,
which is traditionally sequential, is a much larger fraction of time-to-first token.
To solve this, Crusoe built FastTokens, an open-source Rust-based tokenizer which parallelizes things in order to take advantage of all the cores on modern CPUs.
Crusoe had to get creative here because the naive approach doesn't work.
For example, for pre-tokenization,
You can't just split your text into chunks and run regex, because you'd end up with issues whenever a word straddled the split.
Crusoe solved this by giving each thread an authority zone, plus the ability to read one kilobyte past its own edges.
This one kilobyte buffer guarantees that you won't misprocess a token,
And the Authority Zone guarantees that you won't end up with duplicates.
No cross-thread coordination required.
Crusoe combined this optimization with a handful of other smart tweaks in order to get up to 40% faster time-to-first token on real production workloads.