Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
They're trained on so many URLs.
Like in the dictionary learning work, we find there are like three separate features for base 64 encodings.
And even that is kind of an alien example that is probably worth me talking about for a minute.
One of these Base64 features fired for numbers.
If it sees Base64 numbers, it'll predict more of those.
Another fired for letters.
But then there was this third one that we didn't understand.
And it fired for a very specific subset of Base64 features.
And someone on the team who clearly knows way too much about Base64 realized that this was the subset that was ASCII
decodable.
So you could decode it back into the ASCII characters.
And the fact that the model learned these three different features and it took us a little while to figure out what was going on is very Shoggoth-esque.
Yeah, and it's clearly doing something that humans wouldn't.
You can even talk to any of the current models in Base64 and it will apply in Base64.
And you can then decode it and it works great.
No, no, I mean, you could do that, right?
What I was going to say is one technique here is anomaly detection.
And so one beauty of dictionary learning instead of linear probes is that it's unsupervised.
You are just trying to learn to span all of the representations that the model has and then interpret them later.
But if there's a weird feature that suddenly fires for the first time that you haven't seen fire before...