Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
Complex formula omitted from the narration.
Is the parameter vector, F subscript W is the input, output map of the model on parameter.
Complex formula omitted from the narration.
Other, complex formula omitted from the narration.
Training examples and labels, and tau is the learning rate.
In the most common versions of supervised learning, we can focus even further.
The loss function itself can be decomposed into two effects.
The parameter function map, complex formula omitted from the narration.
And the target distribution.
The overall loss function can be written as a composition of the parameter function map and some statistical distance to the target distribution, for example for mean squared error.
Complex formula omitted from the narration.
Where?
Complex formula omitted from the narration.
Note that the statistical distance, complex formula omitted from the narration, here is a fairly simple object.
Almost always the statistical distance here is on function space, convex and with relatively simple functional form.
Further, it is the same distance one would use across many different architectures, including ones which do not achieve the remarkable performance of neural networks, for example polynomial approximation.
Therefore one expects the question of learnability and inductive biases to largely come down to the parameter function map f subscript w rather than the function space loss function .
If the above reasoning is correct, that means that in order to understand how SGD is able to potentially perform some kind of program synthesis, we merely need to understand properties of the parameter function map.
This would be a substantial simplification,
Further, this relates learning dynamics to our earlier representation problem.