Dario Amodei
👤 SpeakerAppearances Over Time
Podcast Appearances
This kind of idea that we came up with, our brilliant research team came up with, is this idea of giving Claude, which is our generative AI tool, something we call a constitution.
It's called constitutional AI.
You used to train these models to make sure, hey, they're not sort of saying nasty things or they're not replicating bad inherent biases, was you would do this technique called reinforcement learning from human feedback, which we also kind of co-worked on when we were at OpenAI.
And reinforcement learning from human feedback is just essentially a way of giving the model grades.
It's like A plus, you said something nice.
D minus, you said something mean.
And that was reasonably effective in changing some behavior of the models.
But what we found was that a lot of subtler things about how models
respond, react, things that they're sort of deeply believing are much harder to train out in those sort of individual cases.
And so we had this idea of giving the model kind of a broader constitution in the way that you would in a society to basically say, what are ways of behaving and engaging that are good for humans, right, for people, and that don't perpetuate some of the problematic things that might be in training data.
I want people to feel like Claude is someone who is, you know, interested in their well-being.
So not just trying to say things that please them, but like genuinely like cares about their life going well insofar as whatever their conception of that is.
isn't going to deceive them or manipulate them.
Anthropic is rejecting an ultimatum from the Pentagon to lift the company's AI safeguards or risk being blacklisted.
Early on, it was clear to us that this model was going to be meaningfully better at cybersecurity capabilities.
There's a kind of accelerating exponential, but along that exponential, there are points of significance.
Claude Mitho's preview is a particularly big jump along that point.
We haven't trained it specifically to be good at cyber.
We trained it to be good at code, but as a side effect of being good at code, it's also good at cyber.
The model that we're experimenting with.