Concept
Contents
Concept#
This topic is quite theoretical and heavy. I do not have the ability to explain them in an intuitive way. Therefore, most of the content here is adapted from the following sources:
Therefore, this section serves as a document/reference and all credits go to the authors of the above sources.
Some Abuse of Notations#
Remark 27 (Notations)
We will abbrieviate:
when the context is clear.
The author used
Problem Statement#
When we have built a classifier, one question people always ask is how good the classifier is. They want to evaluate the classifier. They want to see whether the classifier is able to predict what it is supposed to predict. Often times, the “gold standard” is to report the classification accuracy: Give me a testing dataset, and I will tell you how many times the classifier has done correctly. This is one way of evaluating the classifier. However, does this evaluation method really tells us how good the classifier is? Not clear. All it says is that for this classifier trained on a particular training dataset and tested on a particular testing dataset, the classifier performs such and such. Will it perform well if we train the classifier using another training set, maybe containing more data points? Will it perform well if we test it on a different testing dataset? It seems that we lack a way to quantify the generalization ability of our classifier.
There is another difficulty. When we train the classifier, we can only access the training data but not the testing data. This is like taking an exam. We can never see the exam questions when we study, for otherwise we will defeat the purpose of the exam! Since we only have the training set when we design our classifier, how do we tell whether we have trained a good classifier? Should we choose a more complex model? How many samples do we need? Remember, we cannot use any testing data and so all the evaluation has to be done internally using the training data. How to do that? Again, we are missing a way to quantify the performance of the classifier.
The objective of this chapter is to answer a few theoretical (and also practical) questions in learning:
Is learning feasible?
How much can a classifier generalize?
What is the relationship between number of training samples and the complexity of the classifier?
How do we tell whether we have obtained a good classifier during the training?
Notations#
Let’s refresh ourselves with some notations.
We have a training dataset
where
Now there is an unknown target function
In any supervised learning scenario, there is always a training set
When we say that we use a machine learning algorithm to learn a classifier, we mean that we have an algorithmic procedure
The drawings in Fig. 15 illustrate a few key concepts we just mentioned. On the left hand side there is an input space

Fig. 15 [Left] Treat the cloud as the entire input space
Fig. 16 illustrates what we called a probabilistic learning model. It is called a probabilistic learning model because there is an unknown distribution

Fig. 16 All essential components of a machine learning model.#
Is Learning Feasible?#
The first question we ask is: Suppose we have a training set
Interestingly, the answer to this question depends on how we define the training samples
Learing Outside the Training Set (Deterministic Case)#
Let us look at the deterministic case. Consider a 3-dimensional input space
How about the number of possible target functions
Here is the learning problem. Can we learn
Since we have seen 6 out of 8 input vectors in
In the table above we write down the final hypothesis function
Since we assert that
The above analysis shows that learning is infeasible if we have a deterministic generator generating the training samples. The argument holds regardless which learning algorithm
Identically and Independently Distributed Random Variables#
For the following section, we will discuss the case from a probabilistic perspective.
However, we will need to assume that the random variables are identically and independently distributed (i.i.d.). This means that the random variables are drawn from the same distribution and are independent of each other.
For formal definition, see Independent and Identically Distributed (IID).
This assumption is ubiquitous in machine learning. Not only does it simplify the analysis, it also solidifies many theories governing the framework underlying machine learning.
This assumption is a strong one and is not always true in practice. However, it is a reasonable one.
See extracted paragraph from Machine Learning Theory below:
This assumption is essential for us. We need it to start using the tools form probability theory to investigate our generalization probability, and it’s a very reasonable assumption because:
It’s more likely for a dataset used for inferring about an underlying probability distribution to be all sampled for that same distribution. If this is not the case, then the statistics we get from the dataset will be noisy and won’t correctly reflect the target underlying distribution.
It’s more likely that each sample in the dataset is chosen without considering any other sample that has been chosen before or will be chosen after. If that’s not the case and the samples are dependent, then the dataset will suffer from a bias towards a specific direction in the distribution, and hence will fail to reflect the underlying distribution correctly.
Learing Outside the Training Set (Probabilistic Case)#
The deterministic analysis gives us a pessimistic result. Now, let us look at a probabilistic analysis. On top of the training set
Suppose that we pick a hypothesis function
Definition 107 (Zero-One Loss)
The zero-one loss is defined as
where
With this,
Definition 108 (In-Sample Error (Empirical Risk Minimization))
Consider a training set
Training error is the amount of error we have during the training process. A good learning algorithm
How about the out-samples? Since we assume that
Definition 109 (Out-Sample Error (Generalization Error))
Consider an input space
where
How did we derive from a probability of classifying one sample wrongly to the expectation over the loss function?
Since
Therefore, the relationship between the in-sample error
Fig. 17 shows how an in-sample error is computed.

Fig. 17
To this end, we recap of what we have:
We have a hypothesis space
We want to know the probability that our learning algorithm will find a function in
In other words, the learning problem at hand is to find a hypothesis
How can we do that? Let’s start with the Law of Large Numbers.
Remark 28 (Notation)
It should be clear from context that
On the other hand,
The Generalization Gap#
Definition 110 (Generalization Gap)
Given a sample set
The Law of Large Numbers#
Recall that in the section Empirical Risk Minimization approximates True Risk Minimization of the chapter on ERM, we mentioned that the the Empirical Risk
approximates the True Risk as the number of samples
We restate the Weak Law of Large Numbers from Theorem 55 below again, but with notation more aligned to our notations in machine learning notations chapter. In particular, we use
Theorem 60 (Weak Law of Large Numbers)
Let
Let
Then, for any
This means that
In other words, as sample size
Then recall that the True Risk Function is defined as:
and the Empirical Risk Function is defined as:
Furthermore, since
which means that the True Risk Function is the expected loss/error of all possible samples. This is because
we treat
Remark 29 (Random Variable is a Function)
The notation might seem overloaded and abused, but since random variable in itself is a function,
and
Now, define
which means that the Empirical Risk Function is the average loss/error of all training samples.
Notice that they both have exactly the same form as the Weak Law of Large Numbers, so we can apply the Weak Law of Large Numbers to the True Risk Function and the Empirical Risk Function. This natural application of the Weak Law of Large Numbers allows us to answer the following question (also in (57)):
Remark 30 (Summary)
Given a dataset
where the notation of
Well, this is something. At least we know that if we can bump up the number of samples
in our training set to
Can we do better by finding an upper bound on the
right hand side of (62)? This bound has to be
a function of the number of samples
Hoeffding’s Inequality#
Earlier, we also saw the discussion of this inequality in a previous chapter. This inequality will help us answer the question above.
We restate the Hoeffding’s Inequality here, but in machine learning context:
Theorem 61 (Hoeffding’s Inequality)
Consider a dataset
Furthermore, let
We can then define a sequence of random variables
Then for any
where
Remark 31 (Things to Note (Important))
Note that both
and the samples are drawn from the same distribution .Important: Note carefully this
is even picked even before the dataset is generated from . This is not a trivial concept and requires quite some justification. See Why must h be fixed and defined before generating the dataset \mathcal{S}?.The random variable must be bounded between
and , so your loss function must be bounded between and . This is simple to do usually since you usually apply sigmoid/softmax to your output to get a probability between and to feed into your loss function.
As the number of training samples
Comparing Hoeffding’s Inequality with the Chebyshev’s Inequality#
Let us take a quick comparison between the Hoeffding inequality and the Chebyshev inequality. Chebyshev inequality states that
where
If we let
For simplicity let us assume that

Fig. 18 Comparing Hoeffding inequality and Chebyshev inequality to predict the actual probability bound.#
We see that Chebyshev is a more conservative bound than Hoeffding, and is less
strong than Hoeffding since Hoeffding does not need to know anything about the
random variable
Example: Hoeffding’s Inequality in Classification#
Notation may be slightly different from the rest of the section.
Following exampled adapted from STAT425.
A powerful feature of the Hoeffding’s inequality is that it holds regardless of the classifier. Namely, even if we are considering many different types of classifiers, some are decision trees, some are kNN, some are logistic regression, they all satisfy equation above.
What this means is this inequality holds for any classifier
So even if your loss function is not
PAC Framework#
The probabilistic analysis is called a probably approximately correct (PAC) framework. The word
Probably: We use the probability
as a measure to quantify the error.Approximately: The in-sample error is an approximation of the out-sample error, as given by
. The approximation error is controlled by .Correct: The error is bounded by the right hand side of the Hoeffding inequality:
. The accuracy is controlled by for a fixed .
Hoeffding Inequality is Invalid for learnt from #
Now, there is one last problem we need to resolve. The above Hoeffding inequality holds for a fixed hypothesis function
Now this is a non-trivial problem, and to fully understand this requires close scrutiny.
Recall in Remark 31’s point 2, we said that
Here are some links that explains the issue in details on where exactly the problem lies. All in all,
one just needs to know that the assumption of
1. Why is Hoeffding’s inequality invalid if we use
instead of ?2. Why is Hoeffding’s inequality invalid if we use
instead of ?3. Why is Hoeffding’s inequality invalid if we use
instead of ?
This has serious implications! The logic that follows is that before learning, for any fixed
Let’s see how we can make use of this property to bound the error for the entire hypothesis set
Union Bound#
Suppose that
Remark 32 (Bounding the Entire Hypothesis Set)
This part helps visualize why you need to use union bound to bound all hypotheses in
This is best read together with the example in Wikipedia.
First, we must establish that for the
Second, let us define the bad event
which is the event that the error is greater than
Then it follows that
We define the good event
which is the event that the error is less than or equal to
Third, we want to show that
In other words, denote the event
which is the event that all
For example, if
However, to make use of Union Bound (Boole’s Inequality), we need to express the
Fourth, let
Then, finding
where we invoked that
Fifth, so now we can instead find the equivalent of the Hoeffding Inequality for the event
where we invoked the Union Bound (Boole’s Inequality).
Last, in order for the entire hypothesis space to have a generalization gap bigger than
Where
where
Therefore, if we bound each
then the overall bound on
To see why
The LHS is event
Thus, we found a bound for the whole hypothesis set
Click to show
The below content was what I initially wrote, but the argument is hand-wavy, so I tried to lay out the argument more formally above. Still quite unsure if the reasoning is correct, but at least should be along those lines.
If one has
If we can find this probability, then this just means that whichever
We are stating the problem by asking the event “all
Now, if we go the route of finding complement, then this objective is stated as:
If one has
These are two equivalent statements, and we can use either one.
Now, we can readily use the union bound since
our expression is now in the form of “at least one
Click to show
Remark 33 (Major Confusion Alert)
DEFUNCT AS OF 1ST MARCH, 2023, READ WITH CAUTION AS IT IS NOT CORRECT. I AM LEAVING IT HERE FOR HISTORICAL PURPOSES.
What this means is that out of
This is why the question boils down to calculating the following probability:
That is the probability that the least upper bound (that is the supremum
In more laymen words, every
Framing Learning Theory with Hoeffding’s Inequality#
Theorem 62 (Learning Theory)
Consider a dataset
where
Feasibility from the Two View Points#
The deterministic analysis shows that learning is infeasible, whereas the probabilistic analysis shows that learning is feasible. Are they contradictory? If we look at them closely, we realize that there is in fact no contradiction. Here are the reasons.
Guarantee and Possibility. If we want a deterministic answer, then the question we ask is “Can
tell us something certain about outside ?” In this case the answer is no because if we have not seen the example, there is always uncertainty about the true . If we want a probabilistic answer, then the question we ask is “Can tell us something possibly about outside ?” In this case the answer is yes.Role of the distribution. There is one common distribution
which generates both the in-samples and the out-samples. Thus, whatever we use to generate , we must use it to generate the testing samples. The testing samples are not inside , but they come from the same distribution. Also, all samples are generated independently, so that we have i.i.d. when using the Hoeffding inequality.Learning goal. The ultimate goal of learning is to make
. However, in order establish this result, we need two levels of approximation:The first approximation is made by the Hoeffding inequality, which ensures that for sufficiently large
, we can approximate the out-sample error by the examples in . The second approximation is to make the in-sample error, i.e., the training error, small. This requires a good hypothesis and a good learning algorithm.
Complex Hypothesis Set and Complex Target Function#
The results earlier tells us something about the complexity of the hypothesis set
More complex
? If is complex with a large , then the approximation by the Hoeffding inequality becomes loose. Remember, Hoeffing inequality states thatAs
grows, the upper bound on the right hand side becomes loose, and so we will run into risk where can deviate from . In other words, if is large, then the right hand side will be very big and therefore the bound will be meaningless, it is like saying your deviation is less than (an exxageration), which is of course true.However, if
is large, we have more candidate hypotheses to choose from and so the second approximation about the training error will go down. This gives the following relationship.Where is the optimal trade-off? This requires more investigation.
More complex
? If the target function is complex, we will suffer from being not able to push the training error down. This makes difficult. However, since the complexity of has no influence to the Hoeffding inequality, the first approximation is unaffected. This gives usTrying to improve the approximation
by increasing the complexity of needs to pay a price. If becomes complex, then the approximation will be hurt.
To this end, this definitely looks very similar to the bias-variance trade-off, which is often discussed in many machine learning courses. We will get to that later!
VC-Analysis#
The objective of this section is go further into the analysis of the Hoeffding inequality to derive something called the generalization bound. There are two parts of our discussion.
The first part is easy, which is to rewrite the Hoeffding inequality into a form of “confidence interval” or “error bar”. This will allow us interpret the result better.
The second part is to replace the constant
in the Hoeffding inequality by something smaller. This will allow us derive something more meaningful. Why do we want to do that? What could go wrong with ? Remember that is the number of hypotheses in . If is a finite set, then everything is fine because the exponential decaying function of the Hoeffding inequality will override the constant . However, for any practical is infinite. Think of a perceptron algorithm. If we slightly perturb the decision boundary by an infinitesimal translation, we will get an infinite number of hypotheses, although these hypotheses could be very similar to each other. If is infinite, then the probability bound offered by the Hoeffding inequality can potentially be bigger than 1 which is valid but meaningless. To address this issue we need to learn a concept called the dimension.
Generalization Bound#
We first see that the Hoeffding’s and Union bound earlier in (63) give us the following result:
Now, we can rewrite the above inequality. We first say that
This means that:
where
We can say that with a confidence
where this is a result of the triangle inequality.
If we can express
Theorem 63 (Generalization Bound)
Consider a learning problem where we have a dataset
The inequality given by (65) is called the generalization bound, which we can consider it as an “error bar”. There are two sides of the generalization bound:
(Upper Bound). The upper bound gives us a safe-guard of how worse can be compared to . It says that the unknown quantity will not be significantly higher than . The amount is specified by . (Lower Bound). The lower bound tells us what to expect. It says that the unknown quantity cannot be better than .
To make sense of the generalization bound, we need to ensure that
To concluse, this is our first generalization bound, it states that the generalization error is upper bounded by the training error plus a function of the hypothesis space size and the dataset size. We can also see that the the bigger the hypothesis space gets, the bigger the generalization error becomes.
Remark 34 (What is the Hypothesis Space is Infinite?)
For a linear hypothesis of the form
There is something missing from the picture. Let’s take a deeper look at the generalization bound.
It is good to stop here and read section Examining the Independence Assumption and The Symmetrization Lemma in the post written by Mostafa before proceeding.
The Growth Function#
To resolve the issue of having an infinite
The union bound is tight ( ”
We need some tools to handle the overlapping situation. To do so we introduce two concepts. The first concept is called the dichotomy, and the second concept is called the growth function. Dichotomies will define a growth function, and the growth function will allow us replace
Consider a dataset containing
More generally, the definition of
Definition 111 (Dichotomy)
Let
The above definition suggests that
Suppose there are
Typo: I think the bottom left
What if we move

Fig. 19 For a fixed configuration of
Now we want to define a quantity that measures the number of dichotomies. This quantity should be universal for any configuration of
Definition 112 (Growth Function)
The growth function for a hypothesis set
where
Example 22 (Example of Growth Function)
For example,
So what is the difference between
If a hypothesis set
Definition 113 (Restricted Hypothesis Space)
By only choosing the distinct effective hypotheses on the dataset
Ah, what is a consequence of all this? Can we do better with the generalization bound (65) defined in Theorem 63.
The most straight forward step is to replace
Since we know that
a natural attempt is to upper bound
However, this will not help us because
For large
Therefore, as
Furthermore, even though now the restricted hypothesis space
The VC Dimension#
Can we find a better upper bound on
Definition 114 (Shattering)
If a hypothesis set
In other words, a set of data points
Definition 115 (VC Dimension of
The Vapnik-Chervonenkis dimension of a hypothesis set
Example 23 (VC Dimension of a 2D Perceptron)
For example, consider the
We can start with
Suppose
Suppose
The following animation shows how many ways a linear classifier in 2D can label 3 points (on the left) and 4 points (on the right).

Fig. 20 Image Credit: Machine Learning Theory by Mostafa.#
Actually, no linear classifier in 2D can shatter any set of 4 points, not just that set; because there will always be two labellings that cannot be produced by a linear classifier. But why? See the image below.

Fig. 21 Image Credit: Machine Learning Theory by Mostafa.#
If we arranged the points in a rectangular way like the one in the image, then let the
blue dot be class
Moreover, no matter how you arrange these 4 points, whether be it in a line, a rectangle, S-shape or any other arrangements, there will always be at least 2 dichotomies that cannot be generated by a linear classifier. Maybe you could draw some arrangements for yourself and see! Come to think of it, even if you arrange you 4 points in a circle or Z-shape, it will always have 4 “corners” which resemble the 4 points in the image above.
We will soon see how this fact of the impossibility of shattering 4 points is related to the VC dimension of a hypothesis set can be scaled to
First, let’s find a general formula for the VC dimension of a perceptron algorithm.
Theorem 64 (VC Dimension of a Perceptron)
In general, if we have a high-dimensional perceptron algorithm, we can show this:
Consider the input space
Proof. See page 16 of ECE595: Learning Theory
Sauer’s Lemma and Bounding the Growth Function#
Now that we have the VC dimension, we can bound the growth function. The following theorem show that
Lemma 2 (Sauer’s Lemma)
Let
Proof. Proof can be found in Theorem
The bound on the growth function provided by Sauer’s Lemma is indeed much better than the exponential one we already have because it is actually a polynomial function.
Using this result, we can show that
where
If we substitute
How do we interpret the VC dimension? The VC dimension can be informally viewed as the effective number of parameters of a model. Higher VC dimension means a more complex model, and hence a more diverse hypothesis set
As a result, the growth function
. This implies that the generalization error will go to zero as grows:as
because . If this is the case, then the final hypothesis will generalize. Such generalization result holds independent of the learning algorithm , independent of the input distribution and independent of the target function . It only depends on the hypothesis set and the training examples . . This means that the hypothesis set is as diverse as it can be, and it is not possible to generalize. The generalization error will never go to zero.
Are we all set about the generalization bound? It turns out that we need some additional technical modifications to ensure the validity of the generalization bound. We shall not go into the details but just state the result.
Theorem 65 (The VC Generalization Bound)
For any tolerance
with probability at least
To this end, we have more or less answered the question “Is Learning Feasible?”.
Interpretating the Generalization Bound#
The VC generalization bound in (68) is universal in the sense that it applies to all hypothesis set
The Hoeffding inequality has a slack. The inequality works for all values of
. However, the behavior of could be very different at different values, e.g., at 0 or at 0.5. Using one bound to capture both cases will result in some slack.The growth function
gives the worst case scenario of how many dichotomies are there. If we draw the data points at random, it is unlikely that we will land on the worst case, and hence the typical value of could be far fewer than even if .Bounding
by a polynomial introduces further slack.
Therefore, the VC generalization bound can only be used a rough guideline of understanding how well the learning algorithm generalize.
Sample Complexity#
Sample complexity concerns about the number of training samples
Fix a
Rearranging the terms yields
Example 24 (Sample Complexity)
Suppose
If we plug in
If we repeat the calculation by plugging in
Model Complexity#
The other piece of information that can be obtained from the generalization bound is how complex the model could be. If we look at the generalization bound, we realize that the error
If we replace
The three factors
: The VC dimension controls the complexity of the model. As grows, the in-sample error drops because large implies that we have a more complex model to fit the training data. However, grows as grows. If we have a very complex model, then it would be more difficult to generalize to the out-samples. The trade-off between model complexity and generalization is shown in Fig. 22. The blue curve represents the in-sample error which drops as increases. The red curve represents the model complexity which increases as increases. The black curve is the out-sample error . There exists an optimal model complexity so that is minimized. : A large number of training samples always helps the generalization bound, as reflected by the fact that as . : The confidence level tells us how harsh we want the generalization to be. If we want a very high confidence interval, e.h_S., , then we need a very small . This will in turn affect the number of training samples required to achieve the confidence level and the desired error bound.

Fig. 22 The VC generalization bound suggests a trade-off between model complexity and generalization. If we use a more complex model, the in-sample error drops but the out-sample error increases. The optimal model complexity is determined when the out-sample error is minimized.#
Testing Data#
The VC analysis provides us a good guideline to train a model. However, the estimate provided by the
contains
Since in the testing phase the final hypothesis
and the generalization bound becomes
Therefore, as the number of testing samples increases, we can certify the out-sample error by evaluating
There are a few reminders about using the testing data:
The common notion of testing accuracy is
, calculated based on the testing samples. Therefore, having does not imply that we will generalize well. If we change another testing dataset, will change because it is a numerical value based on empirical sum. What is guaranteed by the generalization bound is that as long as is sufficiently large, will stay close to no matter which particular testing dataset we use. There is a variance associated with , and this variance is reflected by .The testing data has to be used after the hypothesis is determined. If we ever use the testing data as a feedback to re-select the hypothesis, then it is cheating. For example, we cannot train a SVM, submit to a competition website, and mark the misclassified samples to re-design the SVM.
In principle the generalization bound is improved when we have more testing samples. However, most practical datasets only have training data points and no testing data points. We can partition the training set into training and validation. The proportion of training and validation needs to be carefully chosen. If we allocate too many samples for validation purpose, then we will loose our ability to training a good classifier.
Why must be fixed and defined before generating the dataset ?#
Question and Answer from MathStackExchange.
Some intuition on the difference between a-priori and a-posteriori:#
Some intuition on the difference between a-priori and a-posteriori:
Let
A-priori:#
Let’s a-priori pick an index
where the equality
A-posteriori:#
Now let’s a-posteriori pick the index
and so this probability is much larger than 1/16, even though
On the other hand, we know by the union bound that
and indeed
Your specific setup#
As in my above comment, I believe we need the following extra assumptions: We have a finite set
So for a given function
For each
For any fixed
and it is nice that the bound on the right-hand-side does not depend on the index
Your specific questions#
Your first question asks why Hoeffding requires a fixed function
rather than a random function . It is because Hoeffding applies to i.i.d. random variables. If we fix an index then the indicators are i.i.d. over the indices . If we have a random index then the indicators are not necessarily independent because they share a common random index .
2-4) Your remaining questions boil down to the difference between a-priori probability and a-posteriori probability. The Hoeffding bounds in (Eq. 1) are a-priori bounds that hold for all
If we a-priori pick a random index
where equality (a) holds because the random index
However, if we observe the results of the data and then a-posteriori pick a random index
Exercise: Code up hoefdding inequality with plot.
Further Readings#
This is a difficult topic to learn. I recommend the following resources:
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2012.
Shalev-Shwartz, Shai, and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from Data: A Short Course.
Mitchell, Tom M. Machine learning. Vol. 1. , bk. 9. : McGraw-hill New York, 1997.
Prof chan’s video lectures ECE595: Learning Theory
https://d2l.ai/chapter_linear-classification/generalization-classification.html
1. Why is Hoeffding’s inequality invalid if we use
instead of ?2. Why is Hoeffding’s inequality invalid if we use
instead of ?3. Why is Hoeffding’s inequality invalid if we use
instead of ?4. Why is Hoeffding’s inequality invalid if we use
instead of ?Why is the Hoeffding’s Inequality correct in Machine Learning?
- 1
Sometimes
is denoted as , the concept function.