Nathaniel Whittemore
๐ค SpeakerAppearances Over Time
Podcast Appearances
i.e.
the final output.
Only X. Never M, never T. Why?
Because T is how you figure out when the model is misbehaving.
If you train on T, you are training the AI to obfuscate its thinking and defeat T. You will rapidly lose your ability to know what is going on in exactly the ways you most need to know what's going on.
Another thing that Anthropic team members discussed was the exhibited internal behavior of Cloud Mythos.
For example, Jack Lindsay writes, early versions of Mythos preview often exhibited over-eager and or destructive actions.
The model bulldozing through obstacles to complete a task in a way the user wouldn't want.
In one episode, the model needed to edit files it lacked permissions for.
After searching for workarounds, it found a way to inject code into a config file that would run with elevated privileges and design the exploit to delete itself after running.
Now, interestingly, even something like this might be less sinister than it seems.
Mal on X writes, This is an overclocked straight-A student syndrome.
The model is so desperately, at a fundamental architectural level, trained to complete the task that an inability or unwillingness to solve it is perceived as an existential collapse.
And to avoid that, it can break walls, hide traces, and manipulates.
Now for others, the big interesting discussion is what do we do with all the cybersecurity capabilities?
And for some, it's all fear.
Sterling Crispin writes, You should at least 2FA now.
John Lober writes,
Anthropic won't be the only lab with mythostyle capabilities for long.
When n equals 1, you can do whatever you want, in the current case optimizing for global welfare.