Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
If that sounds good to you, you can order the book online by searching for 80,000 Hours Book.
I'll go to 80000hours.org slash book.
It's obviously on Amazon, Audible, wherever you get books where you are.
Benjamin actually narrated the audio version himself.
It's pretty extensive, so better him than me, I think.
And if you already know our advice inside and out, this is an ideal time to tell a friend about it.
Any sales this week will help us hit the bestseller lists, which are apparently super important.
A red teamer was recently embedded inside Anthropic for three weeks, told to imagine that they're an evil clod, and asked to figure out how to get away with launching a rogue AI deployment inside the company.
This was one part of a new push by Meta, the organization famous around the world for their task completion time horizon graph.
to measure how dangerous AI models are before they're released to the public, when they're still secret and only used inside one AI company.
It's something no existing regulations set out to measure, and this represents the first time experts have systematically studied the question, in this case collaborating with Anthropic, OpenAI, Google DeepMind, and Meta to learn about how they deploy unreleased models internally.
The resulting report published this week concludes that AI models have now reached the point where they have the means, the motive, and the opportunity to successfully operate small rogue deployments in order to pursue their own goals.
Goals the company that created them might not share or perhaps might even oppose.
We've reached the point where the notion of rogue AIs taking actions inside an AI company is no longer a purely theoretical concern.
It's something AI companies are going to have to increasingly contend with as a practical security issue.
David Wrighton, the red teamer placed inside Anthropic, he shares that bottom line conclusion and actually identified a number of weaknesses that models could exploit if they wanted to try to pull this off.
But he also found that frontier AI models were comically bad at some parts of the process.
For instance, models would often try to hide secrets from other AIs that were monitoring them by writing in Base64.
But that is not a smart strategy, because just as that AI can write in Base64, the AIs monitoring them can also read Base64.
It would be a bit like trying to hide a secret from a human reading over your shoulder by writing exclusively in capital letters.