Let S be our set of training examples that are related using the following equation:
y(i)=h∗(x(i))+ξ(i)
where h∗ is the best possible classifier that maps the relationship between x and y, and ξ(i)∈N(0,σ2) is the noise in the ith example.
Also, let hs be our best fit model for the dataset S. The Mean Squared Error (MSE) can be written as:
E[(y−hs(x))2]
Also, let havg(x)=E[hs(x)] be the "average model" - the model obtained by drawing an infinite number of datasets, training on them, and averaging their predictions on x.
Then, the Mean Squared Error can be further broken down into 3 components:
The first part is the unavoidable error. It is the noise in the data that cannot be explained by any model, regardless of its complexity.
The bias is the error introduced by the "expressivity handicap" of our classifier. This error occurs because of underfitting.
The variance is the error that measures how much the model's predictions would change if it were trained on a different dataset.
See derivation
The Bias-Variance tradeoff tells us that as we increase the number of parameters in our neural network, the test error will decrease because the bias is decreasing. However, after a certain point the variance starts increasing faster than the bias is decreasing and therefore, the test error will start to increase.
In reality, however, we see a double descent phenomenon wherein the test error starts to decrease again at the point where the number of parameters d are approximately equal to the number of training examples d. This is called the over-parameterization regime.
Complexity Bounds
For any hypothesis h, the true error is given by:
ϵ(h)=P(h(x)=y)
However, since we have no way of determining the underlying probability distribution D, we cannot determine the true error of the hypothesis. Instead, we estimate the empirical error of the hypothesis over our n training examples.
ϵ^(h)=n1i=1∑n1{h(xi)=yi}
Let H be the set of all possible hypotheses that we are considering. To find the hypothesis that minimizes the empirical error over our training set, we find:
h^=argminh∈Hϵ^(h)
Hoeffding Inequality
Hoeffding inequality states that for n independent random variables drawn from a Bernoulli distribution i.e. P(Zi=1)=ϕ and P(Zi=0)=1−ϕ, the following inequality holds:
P(ϕ−ϕ^>γ)≤2exp(−2γ2n)
where ϕ^=n1∑i=1nZi and γ is some constant greater than 0.
Imagine a biased coin that comes up heads with a probability of ϕ. We toss this coin n times and record the average number of times we got heads. We denote this average by ϕ^.
The probability that "the true probability of getting heads is away from our estimated probability by a difference more than γ" is denoted by P(∣ϕ−ϕ^∣>γ). This is always less than or equal to 2exp(−2γ2n).
Using Hoeffding inequality, we note that the true error and the estimated empirical error of our selected hypothesis follow the following inequality:
P(∣ϵ(hi)−ϵ^(hi)∣>γ)≤2exp(−2γ2n)
Now we want to find an inequality for our entire set of hypotheses H={h1,…,hk}. We see that the following inequality holds:
P(∀h∈H.∣ϵ(hi)−ϵ^(hi)∣≤γ)≥1−2kexp(−2γ2n)
We call this the uniform convergence result.
This means that as n increases, the probability of the true error being close to the empirical error is bounded by a bigger value. Whereas, as we increases the number of hypotheses in our set, this probability is actually bounded by a smaller value.
In the context of learning, we can say that the more complex our model, the lower our probability of minimizing the empirical error. And the more the number of training examples, the higher our probability of minimizing the empirical error.
See derivation
Now let δ=2kexp(−2γ2n)
We see that if we want the probability of "the true error being within γ to the empirical error for all hypothesiss under our consideration" to be at least 1−δ, our n needs to be atleast as large as:
n≥2γ21logδ2k
See derivation
Similarly, we can also see that given k and n, the difference between the true error and the empirical error (for all hypotheses is our set) will always be less than:
∣ϵ(hi)−ϵ^(hi)∣≤2n1logδ2k
See derivation
Next, in our set of hypothesiss H, let h∗ be the hypothesis that minimizes the true error and h^ be the hypothesis that minimizes our empirical error.
Using our uniform convergence assumption, we can see that:
ϵ(h^)≤ϵ(h∗)+22n1logδ2k
With a probability of 1−δ, the true error of our selected hypothesis is less than or equal to the true error of the best hypothesis + some term that depends on the number of hypotheses and the number of training examples.
The first term on the right can be thought of as the bias. And the second term can be thought of as the variance. We see that as we increase k, the first term either stays the same or potentially decreases. Whereas, the second term increases.
This is similar to what we saw in the Bias-Variance tradeoff. As we increase the complexity of our model, variance increases and the potential for our model to overfit also increases.