Concept#

Problem Formulation#

Learning curves play a critical role in understanding the performance of a machine learning model. They allow us to visualize the model’s behavior as we increase the size of the training data or the number of iterations during the training process. By doing this, we can identify whether the model suffers from high bias (underfitting) or high variance (overfitting), and assess the potential benefits of adding more training data or adjusting the model’s complexity.

A typical learning curve plots the value of the loss function for the training set and a validation set against the number of training examples or the number of iterations used during training. By examining the gap between the training and validation loss, we can gauge whether the model is overfitting or underfitting. For example, if both the training and validation loss converge to a high value, the model is likely underfitting, indicating that it may not be sufficiently complex to capture the underlying patterns in the data. Conversely, if the training loss is low but the validation loss remains high, the model is likely overfitting, suggesting that it has learned the noise in the training data instead of the underlying structure.

Learning curves can provide valuable insights into the model’s performance and help us decide whether to invest in more training data, adjust the model’s architecture or complexity, or consider alternative optimization techniques to improve convergence.

The key here is we first need a scoring rule to measure how good our model’s prediction is in both the training and validation sets. This leads us to the concept of learning curves.

The Scoring Rule#

One model of machine learning is producing a function, \(h(\boldsymbol{x})\), which given some information, \(\boldsymbol{x}\), predicts some variable, \(\boldsymbol{y}\), from training data \(\boldsymbol{X} \in \mathcal{X}\) and \(\boldsymbol{Y} \in \mathcal{Y}\). It is distinct from mathematical optimization because \(h\) should predict well for \(x\) outside of \(\boldsymbol{X} \in \mathcal{X}\).

We often constrain the possible functions to a parameterized family of functions, \(\left\{h(\boldsymbol{x}; \boldsymbol{\theta}): \boldsymbol{\theta} \in \boldsymbol{\Theta}\right\}\), so that our function is more generalizable or so that the function has certain properties such as those that make finding a good \(h\) easier, or because we have some a priori reason to think that these properties are true.

Given that it is not possible to produce a function that perfectly fits our data, it is then necessary to produce a loss function \(\mathcal{L}\left(h(\boldsymbol{x}; \boldsymbol{\theta}), \boldsymbol{y}\right)\) to measure how good our prediction is. We then define an optimization process which finds a \(\theta\) which minimizes \(\mathcal{L}\left(h(\boldsymbol{x}; \boldsymbol{\theta}), \boldsymbol{y}\right)\), and refer to as \(\widehat{\boldsymbol{\theta}}\), i.e.

\[ \widehat{\theta} = \underset{\boldsymbol{\theta} \in \boldsymbol{\Theta}}{\operatorname{argmin}} \mathcal{L}\left(h(\boldsymbol{x}; \boldsymbol{\theta}), \boldsymbol{y}\right) \]

The loss function is not the only way to measure the performance of a hypothesis \(h \in \mathcal{H}\). We can use other performance metric such as accuracy, precision, recall, F1 score, ROC curve, AUC, etc.

With our scoring rule, we can now define a learning curve.

Training curve for amount of data#

This curve shows how the model’s performance changes as the amount of training data increases. It plots the loss function (a measure of the model’s prediction error) evaluated on the training data against the same loss function evaluated on a validation dataset. By analyzing this curve, we can determine if adding more training data is likely to improve the model’s performance, or if the model suffers from high bias or high variance.

Let our full training data \(\mathcal{S}\) defined as:

\[ \mathcal{S} = \left\{\boldsymbol{x}^{(n)}, \boldsymbol{y}^{(n)}\right\}_{n=1}^N \]

and we further assume that we split our data into two sets, \(\mathcal{S}_{\text{train}}\) and \(\mathcal{S}_{\text{val}}\) such that \(\mathcal{S}_{\text{train}} \cup \mathcal{S}_{\text{val}} = \mathcal{S}\).

Without loss of generality, we can do the following split.

Denote \(\mathcal{S}_{\text{train}}\) as the first \(m\) examples of \(\mathcal{S}\), which will be trained on.

\[ \mathcal{S}_{\text{train}} = \left\{\left(\boldsymbol{x}^{(1)}, \boldsymbol{y}^{(1)}\right), \left(\boldsymbol{x}^{(2)}, \boldsymbol{y}^{(2)}\right), \ldots, \left(\boldsymbol{x}^{(M)}, \boldsymbol{y}^{(M)}\right)\right\} \]

and \(\mathcal{S}_{\text{val}}\) as the remaining \(P=N-M\) examples of \(\mathcal{S}\), which will be used for validation.

\[\begin{split} \begin{aligned} \mathcal{S}_{\text{val}} &= \left\{\left(\boldsymbol{x}^{(M+1)}, \boldsymbol{y}^{(M+1)}\right), \left(\boldsymbol{x}^{(M+2)}, \boldsymbol{y}^{(M+2)}\right), \ldots, \left(\boldsymbol{x}^{(N)}, \boldsymbol{y}^{(N)}\right)\right\} \\ \end{aligned} \end{split}\]

Then the learning curve is the plot of the two curves representing the training loss and validation loss as a function of the number of training examples:

  1. A sequence of cost on the training set:

    \[ \widehat{\mathcal{J}}_1, \widehat{\mathcal{J}}_2, \ldots, \widehat{\mathcal{J}}_M \]

    where each cost is computed as follows:

    \[ \widehat{\mathcal{J}}_m = \frac{1}{m} \sum_{i=1}^m \mathcal{L}\left(h_{\widehat{\boldsymbol{\theta}}}\left(\boldsymbol{x}^{(i)}\right), \boldsymbol{y}^{(i)}\right) \]

    To be more explicit, this also means we have \(M\) training sets, as follows:

    \[ \mathcal{S}_1 = \left\{\left(\boldsymbol{x}^{(1)}, \boldsymbol{y}^{(1)}\right)\right\}, \mathcal{S}_2 = \left\{\left(\boldsymbol{x}^{(1)}, \boldsymbol{y}^{(1)}\right), \left(\boldsymbol{x}^{(2)}, \boldsymbol{y}^{(2)}\right)\right\}, \ldots, \mathcal{S}_M = \left\{\left(\boldsymbol{x}^{(1)}, \boldsymbol{y}^{(1)}\right), \left(\boldsymbol{x}^{(2)}, \boldsymbol{y}^{(2)}\right), \ldots, \left(\boldsymbol{x}^{(M)}, \boldsymbol{y}^{(M)}\right)\right\} \]

    In practice, we do not start with a single training example, but rather with a small number of training examples with a step size of say \(100\) (i.e. \(100, 200, 300, \ldots\))

    Consequently, we have \(M\) hypotheses, as follows:

    \[ \begin{aligned} h_1(\boldsymbol{x} ; \widehat{\boldsymbol{\theta}}), h_2(\boldsymbol{x} ; \widehat{\boldsymbol{\theta}}), \ldots, h_M(\boldsymbol{x} ; \widehat{\boldsymbol{\theta}}) \end{aligned} \]
  2. A sequence of cost on the validation set:

    \[ \widehat{\mathcal{J}}_{1}, \widehat{\mathcal{J}}_{2}, \ldots, \widehat{\mathcal{J}}_{M} \]

    where each cost is computed as follows:

    \[ \widehat{\mathcal{J}}_m = \frac{1}{P} \sum_{i=M+1}^N \mathcal{L}\left(h_i \left(\boldsymbol{x}^{(i)}\right), \boldsymbol{y}^{(i)}\right) \]

    where one notice that the validation set size is fixed at \(P\) examples. This means that the validation loss is computed on the same validation set for all values of \(m\).

We can then compare the two curves to diagnose issues with the model.

We can make the algorithm more robust by repeating the training and validation process multiple times for each training set size \(m\) using different seeds.

Training curve for number of iterations#

Training curve for the number of iterations: This curve shows how the model’s performance changes as the number of training iterations increases. It is particularly relevant for iterative optimization algorithms like gradient descent. The curve plots the loss function evaluated on the training data against the same loss function evaluated on a validation dataset as the number of iterations increases. This can help diagnose issues related to convergence and determine if the optimization process is working effectively.

Many optimization processes are iterative, repeating the same step until the process converges to an optimal value. Gradient descent is one such algorithm. If you define \(\widehat{\boldsymbol{\theta}}\) as the approximation of the optimal \(\boldsymbol{\theta}\) after \(t\) steps.

Define \(T\) to be the total number of steps/epochs. Then the training curve for the number of iterations is the plot of the two curves representing the training loss and validation loss as a function of the number of iterations:

  1. A sequence of cost on the training set:

    \[ \widehat{\mathcal{J}}_1, \widehat{\mathcal{J}}_2, \ldots, \widehat{\mathcal{J}}_T \]

    where each cost is computed as follows:

    \[ \widehat{\mathcal{J}}_t = \frac{1}{M} \sum_{n=1}^M \mathcal{L}\left(h_{\widehat{\boldsymbol{\theta}}}\left(\boldsymbol{x}^{(n)}\right), \boldsymbol{y}^{(n)}\right) \]

    where \(t\) is the iteration number.

  2. A sequence of cost on the validation set:

    \[ \widehat{\mathcal{J}}_{1}, \widehat{\mathcal{J}}_{2}, \ldots, \widehat{\mathcal{J}}_{T} \]

    where each cost is computed as follows:

    \[ \widehat{\mathcal{J}}_t = \frac{1}{P} \sum_{n=M+1}^N \mathcal{L}\left(h_{\widehat{\boldsymbol{\theta}}}\left(\boldsymbol{x}^{(n)}\right), \boldsymbol{y}^{(n)}\right) \]

    where \(t\) is the iteration number.

Improving Model Performance#

  • Get more training data: If your model suffers from high variance (overfitting), getting more training data can help improve the model’s generalization. Check the learning curve: if the validation loss decreases as the training set size increases, adding more data is likely to help.

  • Smaller set of features: If your model suffers from high variance, reducing the number of features can help by simplifying the model. Use feature selection techniques such as Recursive Feature Elimination (RFE), LASSO regularization, or feature importance from tree-based models to identify and remove less important features.

  • Get additional features: If your model suffers from high bias (underfitting), adding more features can help capture more complex patterns in the data. Domain knowledge, feature engineering, or feature extraction techniques like PCA can be useful to generate new features.

  • Try adding polynomial features: If your model suffers from high bias, adding polynomial features can help capture non-linear relationships between input features and the target variable. This can be done using techniques such as PolynomialFeatures in scikit-learn.

  • Try decreasing lambda (regularization strength): If your model suffers from high bias, decreasing the regularization strength (lambda) can allow the model to fit more complex patterns in the data. This can be done by tuning the hyperparameter controlling the regularization strength in your model.

  • Try increasing lambda: If your model suffers from high variance, increasing the regularization strength can help constrain the model and improve generalization. This can be done by tuning the hyperparameter controlling the regularization strength in your model.

To determine which approach to use, you should analyze the learning curve, validation curve, and the model’s performance metrics. If the model exhibits high bias, focus on techniques that increase model complexity (additional features, polynomial features, or decreasing lambda). If the model exhibits high variance, focus on techniques that simplify the model or improve generalization (more training data, smaller feature set, or increasing lambda). Perform a thorough hyperparameter search, such as grid search or random search, to find the optimal combination of parameters for your model.

References and Further Readings#