Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Casey Liss

👤 Person
4566 total appearances

Appearances Over Time

Podcast Appearances

Accidental Tech Podcast
624: Do Less Math in Computers

MOE splits the model into multiple quote-unquote experts and only activates the ones that are necessary. GPT-4 was an MOE model that was believed to have 16 experts with approximately 110 billion parameters each. DeepSeq MLA, multi-head latent attention is the MLA there, was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required.

Accidental Tech Podcast
624: Do Less Math in Computers

MOE splits the model into multiple quote-unquote experts and only activates the ones that are necessary. GPT-4 was an MOE model that was believed to have 16 experts with approximately 110 billion parameters each. DeepSeq MLA, multi-head latent attention is the MLA there, was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required.

Accidental Tech Podcast
624: Do Less Math in Computers

You both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory as every token requires both a key and a corresponding value. DeepSeq MLA makes it possible to compress the key value store, dramatically decreasing the memory usage during inference.

Accidental Tech Podcast
624: Do Less Math in Computers

You both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory as every token requires both a key and a corresponding value. DeepSeq MLA makes it possible to compress the key value store, dramatically decreasing the memory usage during inference.

Accidental Tech Podcast
624: Do Less Math in Computers

So with regard to costs from the DeepSeq V3 paper, which we will link, note that the aforementioned costs include only the official training of DeepSeq V3, excluding the costs associated with prior research and ablation, is that right? Experiments on architectures, algorithms, or data.

Accidental Tech Podcast
624: Do Less Math in Computers

So with regard to costs from the DeepSeq V3 paper, which we will link, note that the aforementioned costs include only the official training of DeepSeq V3, excluding the costs associated with prior research and ablation, is that right? Experiments on architectures, algorithms, or data.

Accidental Tech Podcast
624: Do Less Math in Computers

Indeed. So then there's distillation. This is models, training models. Again, reading from Ben Thompson. Distillation is a means of extracting understanding from another model. You can send inputs to the teacher model and record the outputs and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4.

Accidental Tech Podcast
624: Do Less Math in Computers

Indeed. So then there's distillation. This is models, training models. Again, reading from Ben Thompson. Distillation is a means of extracting understanding from another model. You can send inputs to the teacher model and record the outputs and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4.

Accidental Tech Podcast
624: Do Less Math in Computers

Distillation is easier for a company to do on its own models because they have full access, but you can still do distillation in a somewhat more unwieldy way via API or even, if you get creative, via chat clients. Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access via IP banning, rate limiting, etc.,

Accidental Tech Podcast
624: Do Less Math in Computers

Distillation is easier for a company to do on its own models because they have full access, but you can still do distillation in a somewhat more unwieldy way via API or even, if you get creative, via chat clients. Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access via IP banning, rate limiting, etc.,

Accidental Tech Podcast
624: Do Less Math in Computers

It's assumed to be widespread in terms of model training, and it's why there's an ever-increasing number of models converging on GPT-4.0 quality. This doesn't mean that we know for a fact that DeepSeq distilled 4.0 or clawed, but frankly, it would be odd if they didn't.

Accidental Tech Podcast
624: Do Less Math in Computers

It's assumed to be widespread in terms of model training, and it's why there's an ever-increasing number of models converging on GPT-4.0 quality. This doesn't mean that we know for a fact that DeepSeq distilled 4.0 or clawed, but frankly, it would be odd if they didn't.

Accidental Tech Podcast
624: Do Less Math in Computers

Yeah, so it turns out OpenAI, who by most measures stole the entirety of the world's knowledge in order to train their model, seems to be a little grumpy that somebody's stealing their knowledge to train their model. And I don't really have a lot of sympathy for them on this one, to be honest with you. Like, sorry, them's the brakes. If you're going to be a turd.

Accidental Tech Podcast
624: Do Less Math in Computers

Yeah, so it turns out OpenAI, who by most measures stole the entirety of the world's knowledge in order to train their model, seems to be a little grumpy that somebody's stealing their knowledge to train their model. And I don't really have a lot of sympathy for them on this one, to be honest with you. Like, sorry, them's the brakes. If you're going to be a turd.

Accidental Tech Podcast
624: Do Less Math in Computers

Yep, pretty much. All right. So R1, R10, and reinforcement learning. R1 is a reasoning model like OpenAI's O1. It has the ability to think through a problem producing much higher quality results, particularly in areas like coding, math, and logic. Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function.

Accidental Tech Podcast
624: Do Less Math in Computers

Yep, pretty much. All right. So R1, R10, and reinforcement learning. R1 is a reasoning model like OpenAI's O1. It has the ability to think through a problem producing much higher quality results, particularly in areas like coding, math, and logic. Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function.

Accidental Tech Podcast
624: Do Less Math in Computers

The classic example is AlphaGo, where DeepMind gave the model of the rules of Go to with the reward function of winning the game, and then let the model figure everything else out on its own. This famously ended up working better than the other more human-guided techniques. LLMs to date, however, have relied on reinforcement learning with human feedback.

Accidental Tech Podcast
624: Do Less Math in Computers

The classic example is AlphaGo, where DeepMind gave the model of the rules of Go to with the reward function of winning the game, and then let the model figure everything else out on its own. This famously ended up working better than the other more human-guided techniques. LLMs to date, however, have relied on reinforcement learning with human feedback.

Accidental Tech Podcast
624: Do Less Math in Computers

Humans are in the loop to help guide the model, navigate difficult choices where rewards weren't obvious, etc., RLHF, or reinforcement learning from human feedback, was the key innovation in transforming GPT-3 into chat GPT, with well-formed paragraphs, answers that were concise and didn't trail off into gibberish, etc. R10, however, drops the HF, the human feedback part.

Accidental Tech Podcast
624: Do Less Math in Computers

Humans are in the loop to help guide the model, navigate difficult choices where rewards weren't obvious, etc., RLHF, or reinforcement learning from human feedback, was the key innovation in transforming GPT-3 into chat GPT, with well-formed paragraphs, answers that were concise and didn't trail off into gibberish, etc. R10, however, drops the HF, the human feedback part.