Jacob Hilton
๐ค SpeakerAppearances Over Time
Podcast Appearances
It has 10 parameters and almost perfect 100% accuracy with an error rate of roughly 1 in 13,000.
This means that the difference between the model's logits complex formula omitted from the narration is almost always negative when complex formula omitted from the narration and positive when complex formula omitted from the narration.
We'd like to mechanistically explain why the model has this property.
To do this, note first that because the model uses ReLU activations and there are no biases, delta is a piecewise linear function of x subscript 0 and x subscript 1 in which the pieces are bounded by rays through the origin in the x subscript 0-x the subscript 1 plane.
Now, we can standardize the model to obtain an exactly equivalent model for which the entries of complex formula omitted from the narration lie in complex formula omitted from the narration.
by rescaling the neurons of the hidden state once we do this we see that complex formula omitted from the narration from these observations we can prove that on each linear piece of delta complex formula omitted from the narration with complex formula omitted from the narration and moreover the pieces of delta are arranged in the x subscript 0 the dash x subscript 1 plane according to the following diagram there's an image here
Here, a double arrow indicates that a boundary lies somewhere between its neighboring axis and the dashed line.
Complex formula omitted from the narration.
But we don't need to worry about exactly where it lies within this range.
Looking at the coefficients of each linear piece, we observe that
In the two blue regions, we have underscore or underscore zero.
In the two green regions, we have complex formula omitted from the narration.
In the two yellow regions, we have complex formula omitted from the narration.
To within around one part in, complex formula omitted from the narration.
This implies that complex formula omitted from the narration.
In the blue and green regions above the line, complex formula omitted from the narration.
Complex formula omitted from the narration In the blue and green regions below the line Complex formula omitted from the narration Complex formula omitted from the narration Is approximately proportional to Complex formula omitted from the narration In the two yellow regions Together, these imply that the model has almost 100% accuracy
More precisely, the error rate is the fraction of the unit disc lying between the model's decision boundary and the line, complex formula omitted from the narration, which is around 1 in, complex formula omitted from the narration.
This is very close to the model's empirically measured error rate.
Mean squared error versus compute.