Trenton Bricken
๐ค PersonAppearances Over Time
Podcast Appearances
I'm pretty sure there's nothing in the envelope.
I think Anthropic did a survey of like a whole bunch of people and put that into its constitutional data.
But yeah, I mean, there's a lot more to be done here.
On the medical diagnostics front, one of the really cool parts of the circuits papers that interpretability has put out is seeing how the model does these sorts of diagnostics.
And so you present it with โ there's this specific complication in pregnancy that I'm going to mispronounce.
It presents a number of symptoms that are hard to diagnose, and you basically are like, human, we're in the emergency room.
Sorry, sorry, like human colon, like as in the human prompt is we're in the emergency room, and a woman 20 weeks into gestation is experiencing like these three symptoms.
Like what is the โ you can only ask about one symptom.
What is it?
And then you can see the circuit for the model and how it reasons.
One, you can see it maps 20 weeks of gestation to that the woman's pregnant.
You never explicitly said that.
And then you can see it extract each of these different symptoms early on in the circuit, map all of them to this specific medical case, which is the correct answer here that we were going for.
and then project that out to all of the different possible other symptoms that weren't mentioned, and then have it decide to ask about one of those.
And so it's pretty cool to see this clean medical understanding of cause and effect inside the circuit.
I think people are still sleeping on the circuits work that came out.
If anything because it's just kind of hard to wrap your head around or we're like still getting used to the fact you can even get features for a single layer.
Yeah.
Like in another case there's this poetry example and by the end of the first sentence the model already knows what it wants to write in the poem at the end of the second sentence and it will like backfill and then plan out the whole thing.
Yeah.