Dr. Jeff Beck
๐ค SpeakerAppearances Over Time
Podcast Appearances
The only thing I don't like about test time training is the vast majority of the training that is done.
So in a traditional energy-based model, you always find the minimum with respect to the latent variables, these extra weights, which in the case of test time training is the subset of weights that you're allowed to change during test time.
When you do the training for a traditional energy-based model,
You're allowed to make those changes throughout the entire course of training.
The way that we're often doing test time training these days is we just do regular old neural network learning.
And then finally, when we get to the deployment phase, then we suddenly turn on these additional latents, which are basically some of the weights of the network, and we do an additional bit of learning at that point.
Now, again, not an expert here, but this seems unwise to me.
And the reason it seems unwise is because you didn't train the original network with that on.
You trained it in a completely supervised way.
Now, I'm sure that people are aware of this and it's been addressed in the literature, but I'm not personally aware of that.
I don't think that's how it's used in practice.
My take on it is that an energy-based model and a Bayesian model have a lot in common, right?
In many ways, like energy, I mean, well, literally in physics, right, energy is log probability, right?
Now, of course, there's the normalization factor that you don't need to worry about if you're just minimizing energy.
In a Bayesian framework, that's like saying, well, I'm not actually going to treat some of these latent variables in a probabilistic way.
I'm just going to do maximum or map estimation on some of my variables and just be okay with that.
That's one way to interpret the relationship between an energy-based model and a properly Bayesian model.
There's a happy medium here, though, right?
And the happy medium is you can still treat it as if it's, you know, you don't have to just minimize the energy function.
You can calculate the curvature down there, too, do a Laplace approximation and call yourself a Bayesian again, right?