Approx Estimation Error & ERM

06 Jan 2025

Setup / Assumptions

There exist a data distribution $D$ such that $(x, y) \sim D$.
Independent Samples

Bias & Variance

Bias: The difference between the sampling distribution center and the true parameter value.

Variance: The dispersion of the sampling distribution.

To minimize the variance, we can use the following techniques:

Increase the number of samples for training.
Regularization.

Approximation Estimation Error

Suppose $\mathbb{H}$ is a hypothesis class, $h^*$ is the best hypothesis in $\mathbb{H}$ and $g$ is the best hypothesis globally. Given that $\hat{h}$ is the hypothesis learned from the training data, we have the following error decomposition:

\[\begin{aligned} \epsilon(h) &: \text{Risk / Generalization Error} \\ &= E_{(x, y) \sim D} [ \mathbb{I}\{h(x) \neq y\}] \\ \\ \hat{\epsilon}(h) &: \text{Empirical Risk} \\ &= \frac{1}{m} \sum_{i=1}^{m} \mathbb{I}\{h(x^{(i)}) \neq y^{(i)}\} \\ \\ \epsilon(g) &: \text{Bayes Error / Irreducible Error} \\ \\ \epsilon(h^*) - \epsilon(g) &: \text{Approximation Error} \\ \\ \epsilon(h) - \epsilon(\hat{h}) &: \text{Estimation Error} \\ \\ \epsilon(\hat{h}) &= \text{Estimation Error } + \text{Approximation Error } + \text{Irreducible Error} \end{aligned}\]

We can detailly separate the estimation error as Estimated Bias and Estimated Variance:

\[\begin{aligned} \epsilon(\hat{h}) &= \text{Estimated Bias} + \text{Estimated Variance} + \text{Approximation Error} + \text{Irreducible Error} \\ &= \text{Bias} + \text{Variance} + \text{Irreducible Error} \\ \end{aligned}\]

where:

\[\begin{aligned} \text{Variance} &= \text{Estimated Variance} \\ \text{Bias} &= \text{Estimated Bias} + \text{Irreducible Error} \\ \end{aligned}\]

Empirical Risk Minimization (ERM)

\[\hat{h}_{ERM} = \arg \min_{h \in \mathbb{H}} \frac{1}{m} \sum_{i=1}^{m} \mathbb{I}\{h(x^{(i)}) \neq y^{(i)}\} \\\]

Uniform Convergence

$\hat{\epsilon}(h) \text{ v.s. } \epsilon(h)$
$\epsilon(\hat{h}) \text{ v.s. } \epsilon(h^*)$

so we have that

\[P(\vert \hat{\epsilon}(h_i) - \epsilon(h_i) \vert > \gamma) \leq 2 \exp(-2m\gamma^2)\]

Consider the finite hypothesis class $\mathbb{H}$, and $\vert \mathbb{H} \vert = k$, we have:

\[\begin{aligned} &P(\exist h \in \mathbb{H}, \vert \hat{\epsilon}_{s}(h) - \epsilon(h) \vert > \gamma) \leq 2k \exp(-2m\gamma^2) \\ \implies &P(\forall h \in \mathbb{H}, \vert \hat{\epsilon}(h) - \epsilon(h) \vert \leq \gamma) \geq 1 - 2k \exp(-2m\gamma^2) \\ \end{aligned}\]

$\delta = 2k \exp(-2m\gamma^2)$ is the probability of error.
$\gamma$ is the margin of error.
$m$ is the number of samples.
$\epsilon_s(h)$ is the empirical risk of $h$ for the sample $s$.

Tools

Union Bound

\[P(A \cup B) \leq P(A) + P(B)\]

Hoeffding’s Inequality

Let $Z_1, Z_2, \ldots, Z_m \sim Bern(\phi)$ and $\hat{\phi} = \frac{1}{m} \sum_{i=1}^{m} Z_i$. Then, Let \gamma > 0, we have:

\[P(\vert \hat{\phi} - \phi \vert > \gamma) \leq 2 \exp(-2m\gamma^2)\]

VC Dimension

The VC dimension of a hypothesis class $\mathbb{H}$ is the hypothesis class with infinite high size.

\[\epsilon(\hat{h}) \leq \epsilon(h^*) + O(\sqrt{\frac{VC(\mathbb{H})}{m}log(\frac{m}{VC(\mathbb{H})} + \frac{1}{m}log(\frac{1}{\delta}))})\]