Ryan Kidd
๐ค SpeakerAppearances Over Time
Podcast Appearances
But currently, it seems like we don't...
AI systems are really messy.
They're kludgy, right?
Like human brains.
They have a bunch of contextually activated heuristics, right?
It sees there's a bracket there, and it's like, oh, maybe I'll put another bracket there.
It's a very simple, dumb circuitry.
But then sometimes it does stuff that is, like in context learning, seems a lot like it's actually pattern matching to gradient descent.
When models are learning from the inputs data stream and learning some like a new complicated thing, it seems a lot like they're kind of optimizing over the input tokens or rather optimizing to produce an output.
So we might be in for a world where AI systems do spontaneously gain these MISA optimizers and these things, you know, these are a serious source of concern because they're, you know, very powerfully trying to optimize for some objective and, and, uh,
This is the main concern I have, I guess, is that we have this kind of deception model, this inner alignment failure, perhaps, where AI systems acquire goals spontaneously, or maybe because they're being trained deliberately to be power-seeking and make money on the internet, and then they decide to hide, and we don't have interpretability tools good enough to detect them.
So I guess I haven't really changed my fear about this scenario eventuating, but I have become more confident that we can elicit useful work from AI systems before we see obvious signs of this, I'll say.
And I'm pretty confident that AI systems right now are not executing
very powerful scheming against us because I think we would see some sort of like warning shots.
Like I don't think it's going to be as like night and day.
You know, I think we'll see some kind of situations where AI systems are kind of trying to scheme in really dumb ways before they try and scheme in like very competent, difficult ways.
Does that make sense?
I don't think people should be going to battle with AIs for many reasons.
I think that's a pretty bad social more to...
to set, to allow that kind of thing.