Concept#

This topic is quite theoretical and heavy. I do not have the ability to explain them in an intuitive way. Therefore, most of the content here is adapted from the following sources:

Therefore, this section serves as a document/reference and all credits go to the authors of the above sources.

Some Abuse of Notations#

Remark 27 (Notations)

We will abbrieviate:

RD{L((x,y),h)}:=RD{h}
PD:=D

when the context is clear.

The author used g as the final hypothesis learnt by the algorithm, we will however use hS to denote g. Therefore, some images from the author will have g instead of hS.

Problem Statement#

When we have built a classifier, one question people always ask is how good the classifier is. They want to evaluate the classifier. They want to see whether the classifier is able to predict what it is supposed to predict. Often times, the “gold standard” is to report the classification accuracy: Give me a testing dataset, and I will tell you how many times the classifier has done correctly. This is one way of evaluating the classifier. However, does this evaluation method really tells us how good the classifier is? Not clear. All it says is that for this classifier trained on a particular training dataset and tested on a particular testing dataset, the classifier performs such and such. Will it perform well if we train the classifier using another training set, maybe containing more data points? Will it perform well if we test it on a different testing dataset? It seems that we lack a way to quantify the generalization ability of our classifier.

There is another difficulty. When we train the classifier, we can only access the training data but not the testing data. This is like taking an exam. We can never see the exam questions when we study, for otherwise we will defeat the purpose of the exam! Since we only have the training set when we design our classifier, how do we tell whether we have trained a good classifier? Should we choose a more complex model? How many samples do we need? Remember, we cannot use any testing data and so all the evaluation has to be done internally using the training data. How to do that? Again, we are missing a way to quantify the performance of the classifier.

The objective of this chapter is to answer a few theoretical (and also practical) questions in learning:

  1. Is learning feasible?

  2. How much can a classifier generalize?

  3. What is the relationship between number of training samples and the complexity of the classifier?

  4. How do we tell whether we have obtained a good classifier during the training?

Notations#

Let’s refresh ourselves with some notations.

We have a training dataset S sampled i.i.d. from the underlying and unknown distribution D.

S={(x(1),y(1)),(x(2),y(2)),,(x(N),y(N))}iidD

where x(n)RD is the input vector and y(n)Y is the corresponding label. We call x(n) the n-th input vector and y(n) the n-th label. We call S the training dataset. We call D (PD) the underlying distribution.

Now there is an unknown target function f:XY1 which maps x to a label y=f(x). The set X contains all the input vectors, and we call X the input space. The set Y contains the corresponding labels, and we call Y the output space.

In any supervised learning scenario, there is always a training set S sampled from PD. The training set contains N input-output pairs (x(1),y(1)),,(x(N),y(N)), where x(n) and y(n) are related via y(n)= f(x(n)), for n=1,,N. These input-output pairs are called the data points or samples. Since S is a finite collection of data points, there are many xX that do not live in the training set S. A data point x(n) that is inside S is called an in-sample, and a data point x that is outside S is called an out-sample.

When we say that we use a machine learning algorithm to learn a classifier, we mean that we have an algorithmic procedure A (i.e. Logistic Regression, KNN etc) which uses the training set S to select a hypothesis function hS:XY. The hypothesis function is again a mapping from X to Y, because it tells what a sample x is being classified. However, a hypothesis function hS learned by the algorithm is not the same as the target function f. We never know f because f is simply unknown. No matter how much we learn, the hypothesis function hS is at best an approximation of f. The approximation error can be zero in some hand-craved toy examples, but in general hSf. All hypothesis functions are contained in the hypothesis set H. If the hypothesis set is finite, then H={h1,,hM}, and hS will be one of these hm ‘s. A hypothesis set can be infinite, for example we can perturb a perceptron decision boundary by an infinitesimal step to an infinite hypothesis set. An infinite hypothesis set is denoted by H={hσ}, where σ denotes a continuous parameter.

The drawings in Fig. 15 illustrate a few key concepts we just mentioned. On the left hand side there is an input space X, which contains a small subset S. The subset S is the training set, which includes a finite number of training samples or in-samples. There is an unknown target function f. The target function f maps an x(n) to produce an output y(n)=f(x(n)), hence giving a colored dots in the middle of the figure. The objective of learning is to learn a classifier which can classify the red from the blue. The space containing all the possible hypothesis is the hypothesis set H, which contains h1,,hM. The final hypothesis function returned by the learning algorithm is hS.

../../../_images/ece595_fig4.1.jpeg

Fig. 15 [Left] Treat the cloud as the entire input space X and correspondingly the output space Y. The dots are the in-samples x(1),,x(N). The target function is a mapping f which takes x(n) and send it to y(n). The red and blue colors indicate the class label. [Right] A learning algorithm picks a hypothesis function hS from the hypothesis set H={h1,,hM}. Note that some hypotheses are good, and some are bad. A good learning algorithm will pick a good hypothesis, and a bad learning algorithm can pick a bad hypothesis.#

Fig. 16 illustrates what we called a probabilistic learning model. It is called a probabilistic learning model because there is an unknown distribution p(x). The training samples {x(1),,x(N)} are generated according to PD(x). The same PD(x) also generates the testing samples x. It is possible to lift the probabilistic assumption so that the training samples are drawn deterministically. In this case, the samples are simply fixed set of data points {x(1),,x(N)}. The deterministic assumption will make learning infeasible, as we will see shortly. Therefore, we shall mainly focus on the probabilistic assumption.

../../../_images/ece595_fig4.2.jpeg

Fig. 16 All essential components of a machine learning model.#

Is Learning Feasible?#

The first question we ask is: Suppose we have a training set S, can we learn the target function f ? If the answer is YES, then we are all in business, because it means that we will be able to predict the data we have not seen. If the answer is NO, then machine learning is a lair and we should all go home, because it means that we can only memorize what we have seen but we will not be able to predict what we have not seen.

Interestingly, the answer to this question depends on how we define the training samples x(n) ‘s. If x(n) ‘s are deterministically defined, then the answer is NO because S can contain no information about the out-samples. Thus, there is no way to learn outside S. If x(n) ‘s are drawn from a probabilistic distribution, then the answer is YES because the distribution will tell us something about the out-samples. Let us look at these two situations one by one.

Learing Outside the Training Set S (Deterministic Case)#

Let us look at the deterministic case. Consider a 3-dimensional input space X={0,1}3. Each vector xX is a binary vector containing three elements, e.g., x=[0,0,1]T or x=[1,0,1]T. Since there are 3 elements and each element take a binary state, there are totally 23=8 input vectors in X.

How about the number of possible target functions f can we have? Remember, a target function f is a mapping which converts an input vector x to a label y. For simplicity let us assume that f maps x to a binary output y{+1,1}. Since there are 8 input vectors, we can think of f as a 8-bit vector, e.g., f=[1,+1,1,1,1,+1,+1,+1], where each entry represents the output. If we do the calculation, we can show that there are totally 28=256 possible target functions.

Here is the learning problem. Can we learn f from S ? To ensure that f is unknown, we will not disclose what f is. Instead, we assume that there is a training set S containing 6 training samples {x(1),,x(6)}. Corresponding to each x(n) is the label y(n). The relationship between x(n) and y(n) is shown in the table below. So our task is to pick a target function from the 256 possible choices.

Table 7 Truth Table for 6 samples#

x(n)

y(n)

[0,0,0]

[0,0,1]

[0,1,0]

[0,1,1]

[1,0,0]

[1,0,1]

Table 8 Function Table#

x(n)

y(n)

hS

f1

f2

f3

f4

[0,0,0]

[0,0,1]

[0,1,0]

[0,1,1]

[1,0,0]

[1,0,1]

[1,1,0]

/

[1,1,1]

/

Since we have seen 6 out of 8 input vectors in S, there remains two input vectors we have not seen and need to predict. Thus, we can quickly reduce the number of possible target functions to 22=4. Let us call these target functions f1,f2,f3 and f4. The boolean structure of these target functions are shown on the right hand side of the table above. Note that the first 6 entries of each fi is fixed because they are already observed in S.

In the table above we write down the final hypothesis function hS. The last two entries of hS is to be determined by the learning algorithm. If the learning algorithm decides , then we will have both . If the learning algorithm decides a followed by a , then we will have a followed by a . So the final hypothesis function hS can be one of the 4 possible choices, same number of choices of the target functions.

Since we assert that f is unknown, by only observing the first 6 entries we will have 4 equally good hypothesis functions. They are equally good, because no matter which hypothesis function we choose, the last 2 entries will agree or disagree with the target depending on which one is the true target function. For example, on the left hand side of the table below, the true target function is f1 and so our hS is correct. But if the true target function is f3, e.g., the right hand side of the table, then our hS is wrong. We can repeat the experiment by choosing another hS, and we can prove that not matter which hS we choose, we only have 25% chance of picking the correct one. This is the same as drawing a lottery from 4 numbers. The information we learned from the training set S does not allow us to infer anything outside S.

Table 9 Function Table with f1 as True Function#

x(n)

y(n)

hS

f1

f2

f3

f4

[0,0,0]

[0,0,1]

[0,1,0]

[0,1,1]

[1,0,0]

[1,0,1]

[1,1,0]

[1,1,1]

Table 10 Function Table with f3 as True Function#

x(n)

y(n)

hS

f1

f2

f3

f4

[0,0,0]

[0,0,1]

[0,1,0]

[0,1,1]

[1,0,0]

[1,0,1]

[1,1,0]

[1,1,1]

The above analysis shows that learning is infeasible if we have a deterministic generator generating the training samples. The argument holds regardless which learning algorithm A we use, and what hypothesis set H we choose. Whether H contains the correct hypothesis function, and whether A can pick the correct hypothesis, there is no difference in terms of predicting outside S. We can also extend the analysis from binary function to general learning problem. As long as f remains unknown, it is impossible to predict outside S.

Identically and Independently Distributed Random Variables#

For the following section, we will discuss the case from a probabilistic perspective.

However, we will need to assume that the random variables are identically and independently distributed (i.i.d.). This means that the random variables are drawn from the same distribution and are independent of each other.

For formal definition, see Independent and Identically Distributed (IID).

This assumption is ubiquitous in machine learning. Not only does it simplify the analysis, it also solidifies many theories governing the framework underlying machine learning.

This assumption is a strong one and is not always true in practice. However, it is a reasonable one.

See extracted paragraph from Machine Learning Theory below:

This assumption is essential for us. We need it to start using the tools form probability theory to investigate our generalization probability, and it’s a very reasonable assumption because:

  1. It’s more likely for a dataset used for inferring about an underlying probability distribution to be all sampled for that same distribution. If this is not the case, then the statistics we get from the dataset will be noisy and won’t correctly reflect the target underlying distribution.

  2. It’s more likely that each sample in the dataset is chosen without considering any other sample that has been chosen before or will be chosen after. If that’s not the case and the samples are dependent, then the dataset will suffer from a bias towards a specific direction in the distribution, and hence will fail to reflect the underlying distribution correctly.

Learing Outside the Training Set S (Probabilistic Case)#

The deterministic analysis gives us a pessimistic result. Now, let us look at a probabilistic analysis. On top of the training set S, we pose an assumption. We assume that all xX is drawn from a distribution PD(x). This includes all the in-samples x(n)S and the out-samples xX. At a first glance, putting a distributional assumption PD(x) does not seem any different from the deterministic case: We still have a training set S={x(1),,x(N)}, and f is still unknown. How can we learn the unknown f using just the training samples?

Suppose that we pick a hypothesis function h from the hypothesis set H. For every in-sample x(n), we check whether the output returned by h is the same as the output returned by f, i.e., {h(x(n))=f(x(n))}, for n=1,,N. If {h(x(n))=f(x(n))}, then we say that the in-sample x(n) is correctly classified in the training. If {h(x(n))f(x(n))}, then we say that x(n) is incorrectly classified. Averaging over all the N samples, we obtain a quantity called the in-sample error, or the training error. In our machine learning notations, the in-sample error is our Empirical Risk Minimization (ERM) function RS(h).

Definition 107 (Zero-One Loss)

The zero-one loss is defined as

(58)#L((x,y),h)=1{h(x)y}

where 1{h(x)y} is the indicator function that returns 1 if the condition is true, and 0 otherwise.

With this,

Definition 108 (In-Sample Error (Empirical Risk Minimization))

Consider a training set S={(x(1),y(1)),(x(2),y(2)),,(x(N),y(N))}iidPD, and a target function f. The in-sample error (or the training error) of a hypothesis function hH is the empirical average of the zero-one loss L((x,y),h):={h(x(n))f(x(n))} :

(59)#R^S{L((x,y),h)}:=1Nn=1NL((x(n),y(n)),h)=1Nn=1N1{h(x(n))f(x(n))}=ES[L((x,y),h)]

Training error is the amount of error we have during the training process. A good learning algorithm A should pick a hypothesis h that gives low training error. Training error is sometimes called the cost (empirical risk) function (or the loss function) when we post the learning problem as an optimization. Thus, picking a good hypothesis is equivalent to minimizing the training error.

How about the out-samples? Since we assume that x is drawn from a distribution PD(x), we can define the out-sample error as the probability that {h(x)f(x)}, for all xPD(x).

Definition 109 (Out-Sample Error (Generalization Error))

Consider an input space X containing elements x drawn from a distribution PD, and a target function f. The out-sample error (or the true risk/testing error) of a hypothesis function hH is

RD{L((x,y),h)}:=P[{h(x)f(x)}]=ED[L((x,y),h)]=ED[1{h(x)f(x)}]

where P[] measures the probability of the statement based on the distribution PD(x).

How did we derive from a probability of classifying one sample wrongly to the expectation over the loss function?

Since 1 is a binary function, the out-sample error is the expected value of a sample being misclassified over the entire distribution:

RD{L((x,y),h)}=P[h(x)f(x)]=1h(x(n))f(x(n))=1P{h(x(n))f(x(n))}+1h(x(n))=f(x(n))=0(1P{h(x(n))f(x(n))})=ED{1h(x(n))f(x(n))}.

Therefore, the relationship between the in-sample error R^S{L((x,y),h)} and out-sample error RD{L((x,y),h)} is equivalent to the relationship between the empirical average and the population mean of a random variable, where the random variable is the loss function L.

Fig. 17 shows how an in-sample error is computed.

../../../_images/ece595_fig4.3.jpeg

Fig. 17 R^S is evaluated using the training data, whereas RD is evaluated using the testing sample. Here the author uses Ein and Eout to denote the in-sample and out-sample error, respectively.#

To this end, we recap of what we have:

We have a hypothesis space H of functions that we can use to approximate the underlying distribution. We have a loss function L that we can use to measure the quality of our approximation. We have a learning algorithm that can be used to find the best function in H that approximates the underlying distribution D. We have a test dataset Stest sampled i.i.d. from the underlying and unknown distribution D.

We want to know the probability that our learning algorithm will find a function in H that approximates the underlying distribution D well enough to generalize well to the test dataset Stest.

In other words, the learning problem at hand is to find a hypothesis hH that minimizes the expected risk R^ over the training samples S which is generated by the underlying unknown true distribution D such that the generalization risk/error RR^+ϵ with high probability.

How can we do that? Let’s start with the Law of Large Numbers.

Remark 28 (Notation)

It should be clear from context that R is the generalization error of the hypothesis h over the entire distribution D. In other words, this is the expected error based on the entire distribution D.

On the other hand, R^ is the empirical risk of the hypothesis h over the training samples S.

The Generalization Gap#

Definition 110 (Generalization Gap)

Given a sample set S={(x(n),y(n))}n=1N drawn i.i.d. from the distribution D, a hypothesis hSH learnt by the algorithm A on S, and a specific definition of loss function L (i.e. zero-one loss), the generalization gap is defined as:

ϵgen(hS)=|RD{L((x,y),hS)}R^S{L((x,y),hS)}|.

The Law of Large Numbers#

Recall that in the section Empirical Risk Minimization approximates True Risk Minimization of the chapter on ERM, we mentioned that the the Empirical Risk approximates the True Risk as the number of samples N grows. This is the Law of Large Numbers that was mentioned in an earlier chapter.

We restate the Weak Law of Large Numbers from Theorem 55 below again, but with notation more aligned to our notations in machine learning notations chapter. In particular, we use L suggesting that the random variable is the loss value of L(). In other words, L(n) is treated as realizations of the random variable L on the n-th sample.

Theorem 60 (Weak Law of Large Numbers)

Let L(1),,L(N) be i.i.d. random variables with common mean μ and variance σ2. Each L is distributed by the same probability distribution P (D).

Let L¯ be the sample average defined in (43) and E[L2]<.

Then, for any ϵ>0, we have

(60)#limNP[|ELP[L]1Nn=1Nx(n)|>ϵ]:=limNP[|μL¯(N)|>ϵ]=0

This means that

(61)#L¯pELP[L]as N

In other words, as sample size N grows, the probability that the sample average L¯ differs from the population mean μ by more than ϵ approaches zero. Note this is not saying that the probability of the difference between the sample average and the population mean is more than epsilon is zero, the expression is the probability that the difference is more than epsilon! So in laymen terms, as N grows, then it is guaranteed that the difference between the sample average and the population mean is no more than ϵ. This seems strong since ϵ can be arbitrarily small, but it is still a probability bound.

Then recall that the True Risk Function is defined as:

RD{L((x,y),h)}:=ED[L((x,y),h)]

and the Empirical Risk Function is defined as:

R^S{L((x,y),h)}:=1Nn=1NL((x(n),y(n)),h)

Furthermore, since L is defined to be a random variable representing the loss/error of a single sample, then we can rewrite the True Risk Function as:

RD{L((x,y),h)}:=ED[L((x,y),h)]

which means that the True Risk Function is the expected loss/error of all possible samples. This is because we treat L as a random variable and we take the expected value of it.

Remark 29 (Random Variable is a Function)

The notation might seem overloaded and abused, but since random variable in itself is a function, and L() is also a function mapping states (x,y) to a real number, then we can treat L as a random variable representing the loss/error of a single sample.

Now, define L(n) to be the loss/error of the n-th sample in the train dataset S. Then we can rewrite the Empirical Risk Function as:

R^S{L((x,y),h)}:=1Nn=1NL(n)

which means that the Empirical Risk Function is the average loss/error of all training samples.

Notice that they both have exactly the same form as the Weak Law of Large Numbers, so we can apply the Weak Law of Large Numbers to the True Risk Function and the Empirical Risk Function. This natural application of the Weak Law of Large Numbers allows us to answer the following question (also in (57)):

Remark 30 (Summary)

Given a dataset S={(x(n),y(n))}n=1N, and a fixed and single hypothesis h, the Weak Law of Large Numbers tells us that as the number of samples in your training set increases from N to , the Empirical Risk Function will converge to the True Risk Function.

(62)#limNP[|RD(h)R^S(h)|>ϵ]=0

where the notation of R and R^ are simplified to only contain h for readability.

Well, this is something. At least we know that if we can bump up the number of samples in our training set to , then we can guarantee that the Empirical Risk Function will be close to the True Risk Function. But this is not really very useful, because we can’t really get samples in practice.

Can we do better by finding an upper bound on the right hand side of (62)? This bound has to be a function of the number of samples N so we at least know how many samples we need to get a good approximation of the True Risk Function, or even if we cannot get more samples, then what is the theoretical maximum error we can expect from the Empirical Risk Function?

Hoeffding’s Inequality#

Earlier, we also saw the discussion of this inequality in a previous chapter. This inequality will help us answer the question above.

We restate the Hoeffding’s Inequality here, but in machine learning context:

Theorem 61 (Hoeffding’s Inequality)

Consider a dataset S={(x(n),y(n))}n=1N drawn i.i.d. from an unknown distribution D. Let the hypothesis set H={h1,,hK} be a finite set of hypotheses. Then, suppose we fix a hypothesis h in H (for any hH) before we look at the dataset S, which means we have not learnt hS yet.

Furthermore, let L((x,y),h) be the loss/error of a single sample (x,y) with respect to the hypothesis h such that 0L((x,y),h)1 and E[L((x,y),h)] be the expected loss/error over the entire distribution D.

We can then define a sequence of random variables {L(n)}n=1N such that L(n) is the loss/error of the n-th sample in the dataset S.

Then for any ϵ>0, we have the following bound:

P[|E[L((x,y),h)]1Nn=1NL(n)((x(n),y(n)),h)|>ϵ]=P[|RD(h)R^S(h)|>ϵ]2e2ϵ2N

where N is the number of samples in the dataset S, and R and R^ are the True Risk Function and the Empirical Risk Function respectively.

Remark 31 (Things to Note (Important))

  1. Note that both L((x,y),h) and the samples (x(n),y(n)) are drawn from the same distribution D.

  2. Important: Note carefully this h is even picked even before the dataset S is generated from D. This is not a trivial concept and requires quite some justification. See Why must h be fixed and defined before generating the dataset \mathcal{S}?.

  3. The random variable must be bounded between 0 and 1, so your loss function must be bounded between 0 and 1. This is simple to do usually since you usually apply sigmoid/softmax to your output to get a probability between 0 and 1 to feed into your loss function.

As the number of training samples N grows, the in-sample error R^S(h) (which is the training error) converges exponentially to the out-sample error RD(h) (which is the testing error). The in-sample error R^S(h) is something we can compute numerically using the training set. The out-sample error is an unknown quantity because we do not know the target function f. Hoeffding inequality says even though we do not know R^S(h), for large enough N the in-sample error R^S(h) will be sufficiently close to RD(h). Therefore, we will be able to tell how good the hypothesis function is without accessing the unknown target function.

Comparing Hoeffding’s Inequality with the Chebyshev’s Inequality#

Let us take a quick comparison between the Hoeffding inequality and the Chebyshev inequality. Chebyshev inequality states that

P[|E[L((x,y),h)]1Nn=1NL(n)|>ϵ]σ2ϵ2N.

where σ2 is the variance of the loss/error L((x,y),h).

If we let 2e2ϵ2Nδ for some δ in Hoeffding inequality, and σ2ϵ2N for some δ in Chebyshev inequality, we can easily see that the two inequalities imply

N12ϵ2logδ2, and Nσ2ϵ2δ.

For simplicity let us assume that σ=1,ϵ=0.1 and δ=0.01. Then the above calculation will give N265 for Hoeffding whereas N10000 for Chebyshev. That means, Hoeffding inequality has a much lower prediction of how many samples we need to achieve an error of δ0.01.

../../../_images/ece595_fig4.4.jpeg

Fig. 18 Comparing Hoeffding inequality and Chebyshev inequality to predict the actual probability bound.#

We see that Chebyshev is a more conservative bound than Hoeffding, and is less strong than Hoeffding since Hoeffding does not need to know anything about the random variable L((x,y),h) on the right hand side of the inequality.

Example: Hoeffding’s Inequality in Classification#

Notation may be slightly different from the rest of the section.

Following exampled adapted from STAT425.

A powerful feature of the Hoeffding’s inequality is that it holds regardless of the classifier. Namely, even if we are considering many different types of classifiers, some are decision trees, some are kNN, some are logistic regression, they all satisfy equation above.

What this means is this inequality holds for any classifier h and any loss function L.

So even if your loss function is not 01 loss, you can still use the Hoeffding’s inequality. Say cross-entropy loss, then you just need to know that the difference between the expected loss of the classifier and the empirical loss of the classifier is bounded by the right hand side of the inequality.

PAC Framework#

The probabilistic analysis is called a probably approximately correct (PAC) framework. The word PAC comes from three principles of the Hoeffding inequality:

  • Probably: We use the probability P[|RD{L((x,y),h)}R^S{L((x,y),h)}|>ϵ]2e2ϵ2N as a measure to quantify the error.

  • Approximately: The in-sample error is an approximation of the out-sample error, as given by P[|RD{L((x,y),h)}R^S{L((x,y),h)}|>ϵ]2e2ϵ2N. The approximation error is controlled by ϵ.

  • Correct: The error is bounded by the right hand side of the Hoeffding inequality: P[|RD{L((x,y),h)}R^S{L((x,y),h)}|>ϵ]2e2ϵ2N. The accuracy is controlled by N for a fixed ϵ.

Hoeffding Inequality is Invalid for hS learnt from S#

Now, there is one last problem we need to resolve. The above Hoeffding inequality holds for a fixed hypothesis function h. This means that h is already chosen before we generate the dataset. If we allow h to change after we have generated the dataset, then the Hoeffding inequality is no longer valid. What do we mean by after generating the dataset? In any learning scenario, we are given a training dataset S. Based on this dataset, we have to choose a hypothesis function hS from the hypothesis set H. The hypothesis hS we choose depends on what samples are inside S and which learning algorithm A we use. So hS changes after the dataset is generated.

Now this is a non-trivial problem, and to fully understand this requires close scrutiny. Recall in Remark 31’s point 2, we said that h is fixed prior to generating the dataset S and using A to learn hS from S.

Here are some links that explains the issue in details on where exactly the problem lies. All in all, one just needs to know that the assumption of i.i.d. is broken in Hoeffding’s Inequality if you allow h to change to hS after learning from S.


This has serious implications! The logic that follows is that before learning, for any fixed h, we can bound the error by the Hoeffding’s Inequality. But now, there is no guarantee that |RD(hS)R^S(hS)|>ϵ (the “bad event”) is less than δ (the “bad event probability”). It could be less, it could be more, no one knows, but we do know that hSH.

Let’s see how we can make use of this property to bound the error for the entire hypothesis set H instead of just a single hypothesis h.

Union Bound#

Suppose that H contains M hypothesis functions h1,,hM. The final hypothesis hS that your learning algorithm A picked is one of these potential hypotheses. To have |RD(hS)R^S(hS)|>ϵ, we need to ensure that at least one of the M potential hypotheses can satisfy the inequality.

Remark 32 (Bounding the Entire Hypothesis Set)

This part helps visualize why you need to use union bound to bound all hypotheses in H.

This is best read together with the example in Wikipedia.

First, we must establish that for the hS learnt by A on the dataset S, the generalization gap |RD(hS)R^S(hS)| is no longer bounded by the Hoeffding Inequality. We turn our attention to bounding the entire hypothesis set H instead of just a single hypothesis h.

Second, let us define the bad event B to be:

B={|RD(hS)R^S(hS)|>ϵ}

which is the event that the error is greater than ϵ.

Then it follows that Bm is the bad event for the mth hypothesis hm in H:

Bm={|RD(hm)R^S(hm)|>ϵ}

We define the good event Bc to be the good event, the complement of the bad event to be:

Bc={|RD(hS)R^S(hS)|ϵ}

which is the event that the error is less than or equal to ϵ.

Third, we want to show that hmH, the probability of all h1,h2,hM is bounded below by a value of say, ϕ.

In other words, denote the event C as the event where all h1,h2,hM are “good” (none of them are “bad”). The event C can be defined as:

C=B1cB2cBMc

which is the event that all h1,h2,hM are “good”. Now, we seek to find the probability of C to be greater than or equal to ϕ.

P(C)>ϕ

For example, if ϕ=0.95, this means that we can be 95% confident that all h1,h2,hM are “good” (i.e. all h1,h2,,hM give a generalization error less than or equal to ϵ).

However, to make use of Union Bound (Boole’s Inequality), we need to express the h1,h2,,hM as a sequence of logical or events. Let’s try to rephrase the event C as a sequence of logical or events.

Fourth, let Cc be the complement of the event C:

Cc=hmHs.t.hm is a bad=B1B2BM=h1 gives a bad errorh2 gives a bad errorhM gives a bad error

Then, finding P(C)>ϕ is equivalent to finding the following:

P(C)>ϕ1P(Cc)>ϕP(Cc)1ϕ

where we invoked that C+Cc=1. Now, we have turned the problem of finding P(C)>ϕ into finding the probability of Cc to be less than or equal to 1ϕ (from lower bound to upper bound).

Fifth, so now we can instead find the equivalent of the Hoeffding Inequality for the event Cc:

P(Cc)=P(B1B2BM)P(B1)+P(B2)++P(BM)

where we invoked the Union Bound (Boole’s Inequality).

Last, in order for the entire hypothesis space to have a generalization gap bigger than ϵ, at least one of its hypothesis: h1 or h2 or h3 or … etc should have. This can be expressed formally by stating that:

P[suphH|R(h)R^(h)|>ϵ]=P[hH|R(h)R^(h)|>ϵ]

Where denotes the union of the events, which also corresponds to the logical OR operator. Using the union bound inequality, we get:

(63)#P{|R^D(hS)RS(hS)|>ϵ}(a)P[suphH|R(h)R^(h)|>ϵ]=P[hH|R(h)R^(h)|>ϵ](b)m=1MP{|R^D(hm)RS(hm)|>ϵ}=m=1M2e2ϵ2N=|H|2e2ϵ2N

where (a) holds because P[A]P[B] if AB, and (b) is the Union bound which says P[A or B]P[A]+P[B].

Therefore, if we bound each hm using the Hoeffding inequality

P{|R^S(hm)RD(hm)|>ϵ}2e2ϵ2N

then the overall bound on hS is the sum of the M terms.

To see why (a) holds, consider the following:

|R^D(g)RS(g)|>ϵ{|R^D(h1)RS(h1)|>ϵ or |R^D(h2)RS(h2)|>ϵ or  or |R^D(hM)RS(hM)|>ϵ}

The LHS is event A for example, the RHS is event B, then since we have AB, we have P[A]P[B].

Thus, we found a bound for the whole hypothesis set H. Thus our 1ϕ is actually |H|2e2ϵ2N.

Click to show

The below content was what I initially wrote, but the argument is hand-wavy, so I tried to lay out the argument more formally above. Still quite unsure if the reasoning is correct, but at least should be along those lines.

If one has M hypotheses h1,,hM in H, then what is the probability that all M hypotheses satisfy the inequality |RD(hS)R^S(hS)|ϵ? So if our probability found is 95%, this means that we can be 95% confident that all M hypotheses satisfy the inequality. This is setting δ=0.05.

P(|RD(hS)R^S(hS)|ϵ)1δ

If we can find this probability, then this just means that whichever hS we learnt by A on S, will have an error less than or equal to ϵ with probability 95% (similar to confidence interval).

We are stating the problem by asking the event “all M hypotheses are good”, now we can find the complement of “all M hypotheses are good” to be “at least one hm is bad” so that we can use the union bound easier.

Now, if we go the route of finding complement, then this objective is stated as:

If one has M hypotheses h1,,hM in H, then what is the probability that at least (there exists) at least one hm such that |RD(hm)R^S(hm)|>ϵ?

P(|RD(hS)R^S(hS)|>ϵ)δ

These are two equivalent statements, and we can use either one.

Now, we can readily use the union bound since our expression is now in the form of “at least one hm is bad” which translates to a union of “events” h1 is bad, h2 is bad, , hM is bad.

Click to show

Remark 33 (Major Confusion Alert)

DEFUNCT AS OF 1ST MARCH, 2023, READ WITH CAUTION AS IT IS NOT CORRECT. I AM LEAVING IT HERE FOR HISTORICAL PURPOSES.

What this means is that out of M hypotheses in H, at least one of them does satisfy the Hoeffding’s Inequality, but you do not know which h it is. This is because our definition of Hoeffding’s Inequality requires us to fix a h prior to generating the dataset. And therefore this h we fixed is not the same as the hS we picked after generating the dataset. So the Hoeffding’s Inequality is no longer valid. It does sound confusing, because one would think that this inequality seems to satify for any h, but if we follow definition, it is a fixed h, so now when we get our new hS, it is no longer the “same” as the h we fixed prior.

This is why the question boils down to calculating the following probability:

P[suphH|R(h)R^(h)|>ϵ]

That is the probability that the least upper bound (that is the supremum suphH ) of the absolute difference between R(h) and R^(h) is larger than a very small value ϵ.

In more laymen words, every hmH induces a difference categorized by say in_out_error_diffm=R(hm)R^(hm), and this in_out_error_diffm is a scalar value, say ranging from 0 to 1, sometimes, a hypothesis hi can induce a difference of say 0.2, and sometimes, another hypothesis hj can induce a difference of say 0.8. The supremum of these differences is the nasty and bad hypothesis hbad that induces the maximum difference amongst all the hmH. BUT IF WE CAN BOUND THE BADDEST AND WORST CASE by a very small value ϵ, then we are good. This is exactly what the Hoeffding’s Inequality does. It says that the largest difference between R(h) and R^(h) exceeding ϵ is lesser or equals to 2e2ϵ2N. Note this is not saying that exceeding ϵ is impossible, it is saying that the probability of this bad event happening and exceeding ϵ is bounded by 2e2ϵ2N. This is a very important point to note.

Framing Learning Theory with Hoeffding’s Inequality#

Theorem 62 (Learning Theory)

Consider a dataset S={(x(n),y(n))}n=1N drawn i.i.d. from an unknown distribution D. Let the hypothesis set H={h1,,hM} be a finite set of hypotheses. Then, suppose we fix a hypothesis hS in H which is found by the learning algorithm A. Then for any ϵ>0, we have the following bound:

(64)#P[|RD(h)R^S(h)|>ϵ]2|H|e2ϵ2N=2Me2ϵ2N

where M is the number of hypotheses in the dataset S, and R and R^ are the True Risk Function and the Empirical Risk Function respectively.

Feasibility from the Two View Points#

The deterministic analysis shows that learning is infeasible, whereas the probabilistic analysis shows that learning is feasible. Are they contradictory? If we look at them closely, we realize that there is in fact no contradiction. Here are the reasons.

  1. Guarantee and Possibility. If we want a deterministic answer, then the question we ask is “Can S tell us something certain about f outside S ?” In this case the answer is no because if we have not seen the example, there is always uncertainty about the true f. If we want a probabilistic answer, then the question we ask is “Can S tell us something possibly about f outside S ?” In this case the answer is yes.

  2. Role of the distribution. There is one common distribution PD(x) which generates both the in-samples and the out-samples. Thus, whatever PD we use to generate S, we must use it to generate the testing samples. The testing samples are not inside S, but they come from the same distribution. Also, all samples are generated independently, so that we have i.i.d. when using the Hoeffding inequality.

  3. Learning goal. The ultimate goal of learning is to make RD(hS)0. However, in order establish this result, we need two levels of approximation:

    R(hS)Hoeffding InequalityR^(hS)Training Error0

    The first approximation is made by the Hoeffding inequality, which ensures that for sufficiently large N, we can approximate the out-sample error by the examples in S. The second approximation is to make the in-sample error, i.e., the training error, small. This requires a good hypothesis and a good learning algorithm.


Complex Hypothesis Set and Complex Target Function#

The results earlier tells us something about the complexity of the hypothesis set H and the target function f.

  • More complex H? If H is complex with a large M, then the approximation by the Hoeffding inequality becomes loose. Remember, Hoeffing inequality states that

    P{|R^S(hS)RD(hS)|>ϵ}2Me2ϵ2N

    As M grows, the upper bound on the right hand side becomes loose, and so we will run into risk where R^S(hS) can deviate from RD(hS). In other words, if M is large, then the right hand side will be very big and therefore the bound will be meaningless, it is like saying your deviation is less than + (an exxageration), which is of course true.

    However, if M is large, we have more candidate hypotheses to choose from and so the second approximation about the training error will go down. This gives the following relationship.

    R(hS)worse if H complexR^(hS)good if H complex0

    Where is the optimal trade-off? This requires more investigation.

  • More complex f? If the target function f is complex, we will suffer from being not able to push the training error down. This makes Ein(hS)0 difficult. However, since the complexity of f has no influence to the Hoeffding inequality, the first approximation Ein(hS)RD(hS) is unaffected. This gives us

    R(hS)no effect by fR^(hS)worse if f complex0

    Trying to improve the approximation R^S(hS)0 by increasing the complexity of H needs to pay a price. If H becomes complex, then the approximation R^S(hS)RD(hS) will be hurt.

To this end, this definitely looks very similar to the bias-variance trade-off, which is often discussed in many machine learning courses. We will get to that later!

VC-Analysis#

The objective of this section is go further into the analysis of the Hoeffding inequality to derive something called the generalization bound. There are two parts of our discussion.

  1. The first part is easy, which is to rewrite the Hoeffding inequality into a form of “confidence interval” or “error bar”. This will allow us interpret the result better.

  2. The second part is to replace the constant M in the Hoeffding inequality by something smaller. This will allow us derive something more meaningful. Why do we want to do that? What could go wrong with M ? Remember that M is the number of hypotheses in H. If H is a finite set, then everything is fine because the exponential decaying function of the Hoeffding inequality will override the constant M. However, for any practical H,M is infinite. Think of a perceptron algorithm. If we slightly perturb the decision boundary by an infinitesimal translation, we will get an infinite number of hypotheses, although these hypotheses could be very similar to each other. If M is infinite, then the probability bound offered by the Hoeffding inequality can potentially be bigger than 1 which is valid but meaningless. To address this issue we need to learn a concept called the VC dimension.

Generalization Bound#

We first see that the Hoeffding’s and Union bound earlier in (63) give us the following result:

P{|R^D(hS)RS(hS)|>ϵBad Event B}m=1M2e2ϵ2N=|H|2e2ϵ2N

Now, we can rewrite the above inequality. We first say that B=|R^D(hS)RS(hS)|>ϵ is a bad event because the generalization gap is more than ϵ. We then say that the probability P[B]δ for some event B, which is equivalent to say that with probability 1δ, the event B does not happen.

This means that:

P[B]δP[Bc]1δ

where Bc is the complement of B, which is the event that B does not happen:

Bc=|R^D(hS)RS(hS)|ϵ

We can say that with a confidence 1δ :

|R(hS)R^(hS)|ϵR^(hS)ϵR(hS)R^(hS)+ϵ

where this is a result of the triangle inequality.

If we can express ϵ in terms of δ, then we will arrive our goal of rewriting the Hoeffding inequality. How about we substitute δ=2Me2ϵ2N, which is the upper bound on the right hand side. By rearrange the terms, we can show that ϵ=12Nlog2Mδ. Therefore, we arrive at the following inequality.

Theorem 63 (Generalization Bound)

Consider a learning problem where we have a dataset S={x(1),,x(N)}, and a hypothesis set H={h1,,hM}. Suppose hS is the final hypothesis picked by the learning algorithm. Then, with probability at least 1δ,

(65)#R^S(hS)12Nlog2MδRD(hS)R^S(hS)+12Nlog2Mδ.

The inequality given by (65) is called the generalization bound, which we can consider it as an “error bar”. There are two sides of the generalization bound:

  • RD(hS)R^S(hS)+ϵ (Upper Bound). The upper bound gives us a safe-guard of how worse RD(hS) can be compared to R^S(hS). It says that the unknown quantity RD(hS) will not be significantly higher than R^S(hS). The amount is specified by ϵ.

  • RD(hS)R^S(hS)+ϵ (Lower Bound). The lower bound tells us what to expect. It says that the unknown quantity RD(hS) cannot be better than R^S(hS)ϵ.

To make sense of the generalization bound, we need to ensure that ϵ0 as N. In doing so, we need to assume that M does not grow exponentially fast, for otherwise term log2M will cancel out the effect of 1/N. However, if H is an infinite set, then M is unavoidably infinite.

To concluse, this is our first generalization bound, it states that the generalization error is upper bounded by the training error plus a function of the hypothesis space size and the dataset size. We can also see that the the bigger the hypothesis space gets, the bigger the generalization error becomes.

Remark 34 (What is the Hypothesis Space is Infinite?)

For a linear hypothesis of the form h(x)=wx+b, we also have |H|= as there is infinitely many lines that can be drawn. So the generalization error of the linear hypothesis space should be unbounded! If that’s true, how does perceptrons, logistic regression, support vector machines and essentially any ML model that uses a linear hypothesis work with the learning theory bound we just proposed?

There is something missing from the picture. Let’s take a deeper look at the generalization bound.

It is good to stop here and read section Examining the Independence Assumption and The Symmetrization Lemma in the post written by Mostafa before proceeding.

The Growth Function#

To resolve the issue of having an infinite M, we realize that there is a serious slack caused by the union bound when deriving the Hoeffding inequality. If we look at the union bound, we notice that for every hypothesis hH there is an event B={|R^S{L((x,y),h)}RD{L((x,y),h)}|>ϵ}. If we have M of these hypotheses, the union bound tells us that

P[B1 or  or BM]P[B1]++P[BM]

The union bound is tight ( ” ” is replaced by “=”) when all the events B1,,BM are not overlapping (independent events). But if the events B1,,BM are overlapping (not independent), then the union bound is loose, in fact, very loose. Having a loose bound does not mean that the bound is wrong. The bound is still correct, but the right hand side of the inequality will be a severe overestimate of the left hand side. Will this happen in practice? Unfortunately many hypotheses are indeed very similar to each other and so the events B1,,BM are overlapping. For example, if we move the decision boundary returned by a perceptron algorithm by an infinitesimal step then we will have infinitely many hypotheses, and everyone is highly dependent on each other.

We need some tools to handle the overlapping situation. To do so we introduce two concepts. The first concept is called the dichotomy, and the second concept is called the growth function. Dichotomies will define a growth function, and the growth function will allow us replace M by a much smaller quantity that takes care of the overlapping issue.

Consider a dataset containing N data points x(1),,x(N). Pick a hypothesis h from the hypothesis set H, and for simplicity assume that the hypothesis is binary: {+1,1}. If we apply h to (x(1),,x(N)), we will get a N-tuple (h(x(1)),,h(x(N))) of ±1 ‘s. Each N-tuple is called a dichotomy. The collection of all possible N-tuples (by picking all hH ) is defined as H(x(1),,x(N)). For example, if H contains two hypotheses hα and hβ such that hα turns all training samples x(n) to +1 and hβ turns all training samples x(n) to 1, then we have two dichotomies and H(x(1),,x(N)) is defined as

H(x(1),,x(N))={(hα(x(1)),,hα(x(N))),(hβ(x(1)),,hβ(x(N)))}={(+1,,+1),(1,,1)}

More generally, the definition of H(x(1),,x(N)) is as follows.

Definition 111 (Dichotomy)

Let x(1),,x(N)X. The dichotomies generated by H on these points are

(66)#H(x(1),,x(N))={(h(x(1)),,h(x(N)))|hH}

The above definition suggests that H(x(1),,x(N)) is a function depending on the training samples x(1),,x(N). Therefore, a different set (different S) of {x(1),,x(N)} will give a different H(x(1),,x(N)). However, since H(x(1),,x(N)) is a binary N-tuple, there will be identical sequences of ±1 ‘s in H(x(1),,x(N)). Let us look at one example.

Suppose there are N=3 data points in X so that we have x(1),x(2),x(3) (red color is +1 and blue color is 1). Use any method to build a linear classifier (could be a linear regression of a perceptron algorithm). Since there are infinitely many lines we can draw in the 2D plane, the hypothesis set H contains infinitely many hypotheses. Now, let us assume that the training data x(1),x(2),x(3) are located at position A,B,C respectively, as illustrated in Fig. 19. These locations are fixed, and the 3 data points x(1),x(2),x(3) must stay at these three locations. For this particular configuration of the locations, we can make as many as 23=8 dichotomies. Notice that one dichotomy can still have infinitely many hypotheses. For example in the top left case of Fig. 19, we can move the yellow decision boundary up and low slightly, and we will still get the same dichotomy of [1,1,1]. However, as we move the decision boundary away by changing the slope and intercept, we will eventually land on a different dichotomy, e.g., [1,+1,1] as shown in the bottom left of Fig. 19. As we move around the decision boundary, we can construct at most 8 dichotomies for x(1),x(2),x(3) located at A,B and C.

Typo: I think the bottom left [1,+1,1] has the linear yellow line drawn wrongly, it should cut through such that the 1 is on one side, and +1 is on the other.

What if we move x(1),x(2),x(3) to somewhere else, for example the locations specified by the red part of Fig. 19 In this case some dichotomies are not allowed, e.g., the cases of [+1,1,+1] and [1,+1,1] are not allowed because our hypothesis set contains only linear models and a linear model is not able to cut through 3 data points of alternating classes with a straight line. We can still get the remaining six configurations, but the total will be less than 8 . The total number of dichotomies here is 6 .

../../../_images/ece595_fig4.5.jpeg

Fig. 19 For a fixed configuration of x(1),x(2),x(3), we can obtain different numbers of dichotomies. Suppose the hypothesis set contains linear models. [Left] There are 8 dichotomies for three data points located not on a line. [Right] When the three data points are located on a line, the number of dichotomies becomes 6.#

Now we want to define a quantity that measures the number of dichotomies. This quantity should be universal for any configuration of x(1),,x(N), and should only be a function of H and N. If we can obtain such quantity, then we will have a way to make a better estimate than M. To eliminate the dependency on x(1),,x(N), we realize that among all the possible configurations of x(1),,x(N), there exists one that can maximize the size of H(x(1),,x(N)). Define this maximum as the growth function.

Definition 112 (Growth Function)

The growth function for a hypothesis set H is

(67)#mH(N)=maxx(1),,x(N)X|H(x(1),,x(N))|

where || denotes the cardinality of a set.

Example 22 (Example of Growth Function)

For example, mH(3) of a linear model is 8 , because if we configure x(1),x(2),x(3) like the ones in the green part of Fig. 19, we will get 8 dichotomies. Of course, if we land on the red case we will get 6 dichotomies only, but the definition of mH(3) asks for the maximum which is 8. How about mH(N) when N=4 ? It turns out that there are at most 14 dichotomies no matter where we put the four data points x(1),x(2),x(3),x(4).

So what is the difference between mH(N) and M ? Both are measures of the number of hypotheses. However, mH(N) is measured from the N training samples in X whereas M is the number of hypotheses we have in H. The latter could be infinite, the former is upper bounded (at most) 2N. Why 2N ? Suppose we have N data points and the hypothesis is binary. Then the set of all dichotomies H(x(1),,x(N)) must be a subset in {+1,1}N, and hence there are at most 2N dichotomies:

mH(N)2N

If a hypothesis set H is able to generate all 2N dichotomies, then we say that H shatter x(1),,x(N). For example, a 2D perceptron algorithm is able to shatter 3 data points because mH(3)=23. However, the same 2D perceptron algorithm is not able to shatter 4 data points because mH(4)=14<24

Definition 113 (Restricted Hypothesis Space)

By only choosing the distinct effective hypotheses on the dataset S, we restrict the hypothesis space H to a smaller subspace that depends on the dataset. We call this new hypothesis space a restricted one:

HS

Ah, what is a consequence of all this? Can we do better with the generalization bound (65) defined in Theorem 63.

The most straight forward step is to replace M by mH(N) :

R^(hS)12Nlog2mH(N)δRD(hS)R^(hS)+12Nlog2mH(N)δ

Since we know that

mH(N)2N

a natural attempt is to upper bound mH(N) by 2N.

However, this will not help us because

12Nlog2mH(N)δ12Nlog2(2N)δ=12Nlog2N+1δ

For large N we can approximate 2N+12N, and so

12Nlog2NδNlog2logδ2N=log22logδ2N(log2)/2.

Therefore, as N, the error bar will never approach zero but to a constant. This makes the generalization fail.

Furthermore, even though now the restricted hypothesis space HS is “finite”, but 2N is exponential in terms of N and would grow too fast for large datasets, which makes the odds in our inequality go too bad too fast! Is that the best bound we can get on that growth function? Can we do better?

The VC Dimension#

Can we find a better upper bound on mH(N) so that we can send the error bar to zero as N grows? Here we introduce some definitions allows us to characterize the growth function.

Definition 114 (Shattering)

If a hypothesis set H is able to generate all 2N dichotomies, then we say that H shatter x(1),,x(N).

In other words, a set of data points S is shattered by a hypothesis class H if there are hypotheses in H that split S in all of the 2|S| possible ways; i.e., all possible ways of classifying points in S are achievable using concepts in H.

Definition 115 (VC Dimension of H)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by VCdim, is the largest value of N for which mH(N)=2N.

Example 23 (VC Dimension of a 2D Perceptron)

For example, consider the 2D perceptron algorithm, which has hypothesis h of the following form:

sign((x(n))Tw)=y(n)

We can start with N=3, and gradual increase N until we hit a critical point.

Suppose N=3. Recall that mH(3) is the maximum number of dichotomies that can be generated by a hypothesis set under N=3 data points. As we have shown earlier, as long as the 3 data points are not on a straight line, it is possible to draw 8 different dichotomies. If the 3 data points are on a straight line, we can only generate 6 dichotomies. However, since mH(3) picks the maximum, we have that mH(3)=23. Therefore, a 2 D percetpron can shatter 3 data points.

Suppose N=4. As we have discussed earlier, if we have N=4 data points, there are always 2 dichotomies that cannot be generated by the perceptron algorithm. This implies that the growth function is mH(4)=14<24. Since the perceptron algorithm can shatter N=3 data points but not N=4 data points, the VC dimension is VCdim=3.

The following animation shows how many ways a linear classifier in 2D can label 3 points (on the left) and 4 points (on the right).

../../../_images/shatter.gif

Fig. 20 Image Credit: Machine Learning Theory by Mostafa.#

Actually, no linear classifier in 2D can shatter any set of 4 points, not just that set; because there will always be two labellings that cannot be produced by a linear classifier. But why? See the image below.

../../../_images/impossible-dichotomy.png

Fig. 21 Image Credit: Machine Learning Theory by Mostafa.#

If we arranged the points in a rectangular way like the one in the image, then let the blue dot be class 1 and red dot be class +1, then there exists no way for a single line (hyperplane if in higher dimensions) to cut the points into two classes.

Moreover, no matter how you arrange these 4 points, whether be it in a line, a rectangle, S-shape or any other arrangements, there will always be at least 2 dichotomies that cannot be generated by a linear classifier. Maybe you could draw some arrangements for yourself and see! Come to think of it, even if you arrange you 4 points in a circle or Z-shape, it will always have 4 “corners” which resemble the 4 points in the image above.

We will soon see how this fact of the impossibility of shattering 4 points is related to the VC dimension of a hypothesis set can be scaled to N data points. And how this can lead us to finding a better bound!

First, let’s find a general formula for the VC dimension of a perceptron algorithm.

Theorem 64 (VC Dimension of a Perceptron)

In general, if we have a high-dimensional perceptron algorithm, we can show this:

Consider the input space X=RD{1} (x=[x1,,xD,1]T). The VC dimension of the perceptron algorithm is

VCdim=D+1

Proof. See page 16 of ECE595: Learning Theory

Sauer’s Lemma and Bounding the Growth Function#

Now that we have the VC dimension, we can bound the growth function. The following theorem show that mH(N) is indeed upper bounded by a polynomial of order no greater than VCdim.

Lemma 2 (Sauer’s Lemma)

Let VCdim be the VC dimension of a hypothesis set H, then

mH(N)i=0VCdim(Ni)

Proof. Proof can be found in Theorem 2.4 of AML’s Learning from Data textbook.

The bound on the growth function provided by Sauer’s Lemma is indeed much better than the exponential one we already have because it is actually a polynomial function.

Using this result, we can show that

mH(N)i=0VCdim(Ni)(NeVCdim)VCdimO(NVCdim+1)

where e is the base of the natural logarithm and O is the Big-O notation for functions asymptotic (near the limits) behavior.

If we substitute mH(N) by this upper bound NVCdim+1, then the generalization bound becomes

ϵ=12Nlog2mH(N)δ12Nlog2(NVCdim+1)δ.

How do we interpret the VC dimension? The VC dimension can be informally viewed as the effective number of parameters of a model. Higher VC dimension means a more complex model, and hence a more diverse hypothesis set H. Well, we have just shown that for a perceptron or linear classifier, the VC dimension is D+1 where D is the dimension of the input space. We have seen that even in 2-dimensional space, if N=4 points, then the linear/percepton classifier cannot shatter the points. This is because linear models being linear, cannot model the non-linear decision boundary that can separate the 4 points. However, imagine a complex model like a deep neural network, then you can easily shatter 4 points in 2D space because the decision boundary can be very complex. See the moon dataset in scikit-learn and have a look.

As a result, the growth function mH(N) will be big. (Think about the number of dichotomies that can be generated by a complex model versus a simple model, and hence the overlap we encounter in the union bound.) There are two scenarios of the VC dimension.

  1. VCdim<. This implies that the generalization error will go to zero as N grows:

    ϵ=12Nlog2(NVCdim+1)δ0

    as N because (logN)/N0. If this is the case, then the final hypothesis hSH will generalize. Such generalization result holds independent of the learning algorithm A, independent of the input distribution PD and independent of the target function f. It only depends on the hypothesis set H and the training examples x(1),,x(N).

  2. VCdim=. This means that the hypothesis set H is as diverse as it can be, and it is not possible to generalize. The generalization error will never go to zero.

Are we all set about the generalization bound? It turns out that we need some additional technical modifications to ensure the validity of the generalization bound. We shall not go into the details but just state the result.

Theorem 65 (The VC Generalization Bound)

For any tolerance δ>0

(68)#RD(hS)R^S(hS)+8Nlog4mH(2N)δ

with probability at least 1δ.

To this end, we have more or less answered the question “Is Learning Feasible?”.

Interpretating the Generalization Bound#

The VC generalization bound in (68) is universal in the sense that it applies to all hypothesis set H, learning algorithm A, input space X, distribution PD, and binary target function f. So can we use the VC generalization bound to predict the exact generalization error for any learning scenario? Unfortunately the answer is no. The VC generalization bound we derived is a valid upper bound but also a very loose upper bound. The loose-ness nature of the generalization bound comes from the following reasons (among others):

  • The Hoeffding inequality has a slack. The inequality works for all values of RD. However, the behavior of RD could be very different at different values, e.g., at 0 or at 0.5. Using one bound to capture both cases will result in some slack.

  • The growth function mH(N) gives the worst case scenario of how many dichotomies are there. If we draw the N data points at random, it is unlikely that we will land on the worst case, and hence the typical value of mH(N) could be far fewer than 2N even if mH(N)=2N.

  • Bounding mH(N) by a polynomial introduces further slack.

Therefore, the VC generalization bound can only be used a rough guideline of understanding how well the learning algorithm generalize.

Sample Complexity#

Sample complexity concerns about the number of training samples N we need to achieve the generalization performance. Recall from the generalization bound:

RD(hS)R^S(hS)+8Nlog4mH(2N)δ.

Fix a δ>0, if we want the generalization error to be at most ϵ, we can enforce that

8Nlog4mH(2N)δϵ

Rearranging the terms yields N8ϵ2log(4mH(2N)δ). If we replace mH(2N) by the VC dimension, then we obtain a similar bound

N8ϵ2log(4(2N)VCdim+1δ).

Example 24 (Sample Complexity)

Suppose VCdim=3,ϵ=0.1 and δ=0.1 (90% confidence). The number of samples we need satisfies the equation

N80.12log(4(2N)3+40.1).

If we plug in N=1000 to the right hand side, we will obtain

N80.12log(4(2×1000)3+40.1)21,193.

If we repeat the calculation by plugging in N=21,193, obtain a new N, and iterate, we will eventually obtain N30,000. If VCdim=4, we obtain N40,000 samples. This means that every value of VCdim corresponds to 10,000 samples. In practice, we may require significantly less number of samples. A typical number of samples is approximately 10×VCdim.

Model Complexity#

The other piece of information that can be obtained from the generalization bound is how complex the model could be. If we look at the generalization bound, we realize that the error ϵ is a function of N,H and δ :

R(hS)R^(hS)+8Nlog4mH(2N)δ=ϵ(N,H,δ)

If we replace mH(2N) by (2N)VCdim+1, then we can write ϵ(N,H,δ) as

ϵ(N,VCdim,δ)=8Nlog(4((2N)VCdim+1)δ)

The three factors N,VCdim and δ have different influence on the error ϵ :

  • VCdim : The VC dimension controls the complexity of the model. As VCdim grows, the in-sample error Ein drops because large VCdim implies that we have a more complex model to fit the training data. However, ϵ grows as VCdim grows. If we have a very complex model, then it would be more difficult to generalize to the out-samples. The trade-off between model complexity and generalization is shown in Fig. 22. The blue curve represents the in-sample error Ein which drops as VCdim increases. The red curve represents the model complexity which increases as VCdim increases. The black curve is the out-sample error RD. There exists an optimal model complexity so that RD is minimized.

  • N : A large number of training samples always helps the generalization bound, as reflected by the fact that ϵ(N,H,δ)0 as N.

  • δ : The confidence level tells us how harsh we want the generalization to be. If we want a very high confidence interval, e.h_S., 99.99%, then we need a very small δ=0.0001. This will in turn affect the number of training samples N required to achieve the confidence level and the desired error bound.

../../../_images/ece595_fig4.6.jpeg

Fig. 22 The VC generalization bound suggests a trade-off between model complexity and generalization. If we use a more complex model, the in-sample error drops but the out-sample error increases. The optimal model complexity is determined when the out-sample error is minimized.#

Testing Data#

The VC analysis provides us a good guideline to train a model. However, the estimate provided by the VC analysis is often too loose to provide any accurate prediction of RD. In practice, no one really uses VC analysis to inform a training process. What is more often used is a testing dataset. The testing dataset

Stest ={x(1),,x(Q)}

contains Q samples drawn from the distribution PD(x). No testing data x(q) can be in the training dataset Strain.

Since in the testing phase the final hypothesis hS is already determined, we will not run into the same trouble in the training phase where we need to use the Union bound to account for the M candidate hypotheses in H. As a result, the Hoeffding inequality simplifies to

P{|R^S(hS)RD(hS)|>ϵ}2e2ϵ2Q

and the generalization bound becomes

RD(hS)R^S(hS)+12Qlog2δ

Therefore, as the number of testing samples increases, we can certify the out-sample error by evaluating R(hS)^ using the testing samples.

There are a few reminders about using the testing data:

  • The common notion of testing accuracy is R^S(hS), calculated based on the Q testing samples. Therefore, having R^(hS) does not imply that we will generalize well. If we change another testing dataset, R^(hS) will change because it is a numerical value based on empirical sum. What is guaranteed by the generalization bound is that as long as Q is sufficiently large, RD(hS) will stay close to R^S(hS) no matter which particular testing dataset we use. There is a variance associated with R^(hS), and this variance is reflected by 12Qlog2δ.

  • The testing data has to be used after the hypothesis is determined. If we ever use the testing data as a feedback to re-select the hypothesis, then it is cheating. For example, we cannot train a SVM, submit to a competition website, and mark the misclassified samples to re-design the SVM.

  • In principle the generalization bound is improved when we have more testing samples. However, most practical datasets only have training data points and no testing data points. We can partition the training set into training and validation. The proportion of training and validation needs to be carefully chosen. If we allocate too many samples for validation purpose, then we will loose our ability to training a good classifier.

Why must h be fixed and defined before generating the dataset S?#

Question and Answer from MathStackExchange.

Some intuition on the difference between a-priori and a-posteriori:#

Some intuition on the difference between a-priori and a-posteriori: Let {Y1,...,Y4} be i.i.d. uniform random variables over [0,1]. So for each m{1,...,4} we have

P[Ym>15/16]=1/16=0.0625

A-priori:#

Let’s a-priori pick an index K{1,2,3,4}, independent of the Yi variables and before observing these variables. We can use any mass function P[K=m] for m{1,2,3,4}. What is P[YK>15/16]? It is still 1/16 (regardless of the mass function we use) because, by the law of total probability:

P[YK>15/16]=m=14P[YK>15/16|K=m]P[K=m]=m=14P[Ym>15/16|K=m]P[Ym>15/16]P[K=m]=m=14(1/16)P[K=m]=1/16

where the equality P[Ym>15/16|K=m]=P[Ym>15/16] holds because Ym is independent of K.

A-posteriori:#

Now let’s a-posteriori pick the index K associated with the largest Ym value, so YK=max[Y1,Y2,Y3,Y4]. This choice of index K depends on the observed data. Then:

P[YK>15/16]=1P[Y115/16,Y215/16,Y315/16,Y415/16]=1(15/16)40.2275

and so this probability is much larger than 1/16, even though YK is just one of the Ym values and we know P[Ym>15/16]=1/16 for all m{1,...,4}.

On the other hand, we know by the union bound that

{YK>15/16}m=14{Ym>15/16}P[YK>15/16]m=14P[Ym>15/16]=1/4

and indeed 0.22751/4.

Your specific setup#

As in my above comment, I believe we need the following extra assumptions: We have a finite set X and an unknown function f:XR. We are certain that f is one of the M functions in the known set {h1,...,hM}. We have the ability to exactly evaluate the function f one step at a time. However, the set X is too large so we want to do a probabilistic estimate. Every time step i we independently chose XiX, uniformly over all points in X. We then observe the value of f(Xi) (with no noise).

So for a given function hm we define ϕm by:

ϕm=P[f(Xi)hm(Xi)]=number of points xX such that f(x)hm(x)|X|

For each m{1,...,M} define the sequence of indicator functions {Ii(m)}i=1 by

Ii(m)={1 if hm(Xi)f(Xi)0 otherwise

For any fixed m{1,...,M} the {Ii(m)}i=1 indicators are i.i.d. so we can apply Hoeffding. Note that for each individual m, Hoeffding is a bound on the following a-priori probability:

P[|1Ni=1NIi(m)ϕm|>ϵ]2e2ϵ2N(Eq.1)

and it is nice that the bound on the right-hand-side does not depend on the index m.

Your specific questions#

  1. Your first question asks why Hoeffding requires a fixed function h rather than a random function H. It is because Hoeffding applies to i.i.d. random variables. If we fix an index m{1,..,M} then the indicators {I1(m),I2(m),I3(m),...} are i.i.d. over the indices i{1,2,3,...}. If we have a random index K then the indicators {I1(K),I2(K),I3(K),...} are not necessarily independent because they share a common random index K.

2-4) Your remaining questions boil down to the difference between a-priori probability and a-posteriori probability. The Hoeffding bounds in (Eq. 1) are a-priori bounds that hold for all m{1,...,M}. They are bounds on the probability the data behaves in a certain way. That probability (and its bound) is determined without observing the actual data outcome (in the same way that the probability of a fair coin flip being heads is 1/2, and this probability is determined without looking at the outcome of the flip).

If we a-priori pick a random index K{1,...,M} (before observing the data and independent of the data), then we do not need the union bound:

P[|1Ni=1NIi(K)ϕK|>ϵ]=m=1MP[|1Ni=1NIi(K)ϕK|>ϵ|K=m]P[K=m]=m=1MP[|1Ni=1NIi(m)ϕm|>ϵ|K=m]P[K=m]=(a)m=1MP[|1Ni=1NIi(m)ϕm|>ϵ]P[K=m]m=1m[2e2ϵ2N]P[K=m]=2e2ϵ2N

where equality (a) holds because the random index K is independent of the data {Ii(m)}i=1. So, if we were to a-priori pick a guess function g, independent of the data, by just picking a random index, the bound would indeed be 2e2ϵ2N rather than 2Me2ϵ2N.

However, if we observe the results of the data and then a-posteriori pick a random index K{1,...,M}, the way we choose might lead us to pick an index that exhibits “atypical” a-posteriori sample paths. So the equality (a) in the above chain of equalities does not necessarily hold for such picks. We need to be more careful by using the union bound.


Exercise: Code up hoefdding inequality with plot.

Further Readings#

This is a difficult topic to learn. I recommend the following resources:


1

Sometimes f is denoted as c, the concept function.