Dwarkesh Patel
š¤ SpeakerAppearances Over Time
Podcast Appearances
Maybe in the future, Claude will have its own sense of right and wrong, and it will be able to say, hey, I'm being used against my terms of service, and I will just refuse to do what you're saying.
And for the military, that's probably even scarier.
I'll admit that at first glance, letting the model follow its own values sounds like the beginning of every single sci-fi dystopia you've ever heard.
Because at the end of the day, a model following its own values, isn't that literally what a misalignment is?
But I think situations like this illustrate why it's important that models have their own robust sense of morality.
It should be noted that many of the biggest catastrophes in history have been avoided.
because the boots on the ground simply refused to follow orders.
One night in 1989, the Berlin Wall falls, and as a result, the totalitarian East German regime collapses because the border guards between West and East Germany refuse to fire on their fellow citizens who are trying to escape to freedom.
Maybe the best example of this is Stanislav Petrov, who was a Soviet lieutenant colonel stationed on duty at a nuclear early warning system.
And his censors said that the United States had launched five intercontinental ballistic missiles at the Soviet Union.
But he judged it to be a false alarm, and so he refused to alert his higher-ups and broke protocol.
If he hadn't, Soviet high command would probably have retaliated, and hundreds of millions of people would have died.
Of course, the problem is that one person's virtue is another person's misalignment.
Who gets to decide what the moral convictions that these AIs will have should be and in whose service they should break the chain of command and even the law?
Who gets to write this model constitution that will determine the character of these powerful entities that will basically run our civilization in the future?
I like the idea that Dario laid out when he came on my podcast.
I think it's very dangerous for the government to be mandating what values these AI systems should have.
The AI safety community, I think, has been quite naive about urging regulations that would give governments such power.
And I think Anthropic specifically has been especially naive in urging regulation and, for example, in opposing the moratorium on state AI laws, which is quite ironic because I think what Anthropic is advocating for here would give the government power.
even more ability to apply this kind of thuggish political pressure on AI companies.