Approx Estimation Error & ERM
Setup / Assumptions
-
There exist a data distribution $D$ such that $(x, y) \sim D$.
-
Independent Samples
Bias & Variance
Bias: The difference between the sampling distribution center and the true parameter value.
Variance: The dispersion of the sampling distribution.
To minimize the variance, we can use the following techniques:
-
Increase the number of samples for training.
-
Regularization.
Approximation Estimation Error
Suppose $\mathbb{H}$ is a hypothesis class, $h^*$ is the best hypothesis in $\mathbb{H}$ and $g$ is the best hypothesis globally. Given that $\hat{h}$ is the hypothesis learned from the training data, we have the following error decomposition:
\[\begin{aligned} \epsilon(h) &: \text{Risk / Generalization Error} \\ &= E_{(x, y) \sim D} [ \mathbb{I}\{h(x) \neq y\}] \\ \\ \hat{\epsilon}(h) &: \text{Empirical Risk} \\ &= \frac{1}{m} \sum_{i=1}^{m} \mathbb{I}\{h(x^{(i)}) \neq y^{(i)}\} \\ \\ \epsilon(g) &: \text{Bayes Error / Irreducible Error} \\ \\ \epsilon(h^*) - \epsilon(g) &: \text{Approximation Error} \\ \\ \epsilon(h) - \epsilon(\hat{h}) &: \text{Estimation Error} \\ \\ \epsilon(\hat{h}) &= \text{Estimation Error } + \text{Approximation Error } + \text{Irreducible Error} \end{aligned}\]We can detailly separate the estimation error as Estimated Bias and Estimated Variance:
\[\begin{aligned} \epsilon(\hat{h}) &= \text{Estimated Bias} + \text{Estimated Variance} + \text{Approximation Error} + \text{Irreducible Error} \\ &= \text{Bias} + \text{Variance} + \text{Irreducible Error} \\ \end{aligned}\]where:
\[\begin{aligned} \text{Variance} &= \text{Estimated Variance} \\ \text{Bias} &= \text{Estimated Bias} + \text{Irreducible Error} \\ \end{aligned}\]Empirical Risk Minimization (ERM)
\[\hat{h}_{ERM} = \arg \min_{h \in \mathbb{H}} \frac{1}{m} \sum_{i=1}^{m} \mathbb{I}\{h(x^{(i)}) \neq y^{(i)}\} \\\]Uniform Convergence
-
$\hat{\epsilon}(h) \text{ v.s. } \epsilon(h)$
-
$\epsilon(\hat{h}) \text{ v.s. } \epsilon(h^*)$
so we have that
\[P(\vert \hat{\epsilon}(h_i) - \epsilon(h_i) \vert > \gamma) \leq 2 \exp(-2m\gamma^2)\]Consider the finite hypothesis class $\mathbb{H}$, and $\vert \mathbb{H} \vert = k$, we have:
\[\begin{aligned} &P(\exist h \in \mathbb{H}, \vert \hat{\epsilon}_{s}(h) - \epsilon(h) \vert > \gamma) \leq 2k \exp(-2m\gamma^2) \\ \implies &P(\forall h \in \mathbb{H}, \vert \hat{\epsilon}(h) - \epsilon(h) \vert \leq \gamma) \geq 1 - 2k \exp(-2m\gamma^2) \\ \end{aligned}\]-
$\delta = 2k \exp(-2m\gamma^2)$ is the probability of error.
-
$\gamma$ is the margin of error.
-
$m$ is the number of samples.
-
$\epsilon_s(h)$ is the empirical risk of $h$ for the sample $s$.
Tools
Union Bound
\[P(A \cup B) \leq P(A) + P(B)\]Hoeffding’s Inequality
Let $Z_1, Z_2, \ldots, Z_m \sim Bern(\phi)$ and $\hat{\phi} = \frac{1}{m} \sum_{i=1}^{m} Z_i$. Then, Let \gamma > 0, we have:
\[P(\vert \hat{\phi} - \phi \vert > \gamma) \leq 2 \exp(-2m\gamma^2)\]VC Dimension
The VC dimension of a hypothesis class $\mathbb{H}$ is the hypothesis class with infinite high size.
\[\epsilon(\hat{h}) \leq \epsilon(h^*) + O(\sqrt{\frac{VC(\mathbb{H})}{m}log(\frac{m}{VC(\mathbb{H})} + \frac{1}{m}log(\frac{1}{\delta}))})\]