Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh

๐Ÿ‘ค Speaker
1735 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

Like, we have seen a lot of these failure modes in the AI agents that exist already.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I would also cite, I can't remember exactly which paper this is, I think it's a Dan Hendricks one, where they looked at the hallucinate.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

They found a vector for...

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

AI dishonesty.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

They asked it a bunch of questions.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

They told it, be dishonest a bunch of times until they figured out which weights were activated when it was dishonest.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

And then they ran it through a bunch of things like this.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I think it was the source hallucination in particular.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

And they found that it did activate the dishonesty vector.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I kind of want to call you out on the claim that groups of humans don't plot against other groups of humans.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I do think we are all descended from the groups of humans who successfully exterminated the other groups of humans, most of whom throughout history have been wiped out.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I think even with questions of class, race, gender, things like that,

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

there are many examples of the working class rising up and killing everybody else.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

And if you look at why this happens, why this doesn't happen, it tends to happen in cases where one group has an overwhelming advantage.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

This is relatively easy for them.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

you tend to get more of a diffusion of power democracy where there are many different groups and none of them can really act on their own, and so they all have to form a coalition with each other.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I think we are expecting... There's also cases where it's very obvious who's part of what group.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

So, for example, with class, it's hard to tell whether the middle class should support the working class versus the aristocrats.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

I think with race, it's very easy to know whether you're black or white.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

And so there have been many cases of one race kind of conspiring against another for a long time, like apartheid or any of the racial genocides that have happened.