\[ \newcommand{\F}{\mathbb{F}} \newcommand{\R}{\mathbb{R}} \newcommand{\v}{\mathbf{v}} \newcommand{\a}{\mathbf{a}} \newcommand{\b}{\mathbf{b}} \newcommand{\c}{\mathbf{c}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\yhat}{\mathbf{\hat{y}}} \newcommand{\0}{\mathbf{0}} \newcommand{\1}{\mathbf{1}} \]

(Root) Mean Squared Error#

Mean Squared Error is a risk metric corresponding to the expected value of the mean error loss or \(l2\)-norm loss.

Definition (Root Mean Squared Error)#

Given a dataset of \(n\) samples indexed by the tuple pair \((x_i, y_i)\), the mean squared error (MSE) is defined as:

\[ \textbf{MSE} = \dfrac{\sum_{i=1}^n \left(\hat{y}_i - y_i\right)^2}{n} \]

and for root mean squared error (RMSE)

\[ \textbf{RMSE} = \sqrt{\dfrac{\sum_{i=1}^n \left(\hat{y}_i - y_i\right)^2}{n}} \]

Estimator#

The MSE of an estimator \(\hat{\theta}\) with respect to an unknown parameter \(\theta\) is definend as:

\[ \textbf{MSE}(\hat{\theta}) = E_{\theta}\left[\left(\hat{\theta} - \theta \right)^2 \right] \]

The MSE can also be written as the sum of the variance and squared bias of the estimator, in which case if the estimators are unbiased, we recover the MSE to be equivalent as the variance:

\[ \textbf{MSE}(\hat{\theta}) = \textbf{Var}_{\theta}\left(\hat{\theta} \right) + \textbf{Bias}\left(\hat{\theta}, \theta \right)^2 \]

Proof of which can be found in Wikipedia: Mean_squared_error.

Theorem (Optimality)#

The mean minimizes the mean squared error.

Proof: https://math.stackexchange.com/questions/2554243/understanding-the-mean-minimizes-the-mean-squared-error

Implementation of MSE#

import numpy as np

def mean_squared_error_(
    y_true: np.ndarray, y_pred: np.ndarray, squared: bool = True
) -> float:
    """Mean squared error regression loss.

    Args:
        y_true (np.ndarray): Ground truth (correct) target values.
        y_pred (np.ndarray): Estimated target values.
        squared (bool): If True, returns MSE; if False, returns RMSE.

    Shape:
        y_true: (n_samples, )
        y_pred: (n_samples, )

    Returns:
        loss (float): The mean squared error.

    Examples:
    >>> y_true = [3, -0.5, 2, 7]
    >>> y_pred = [2.5, 0.0, 2, 8]
    >>> mean_squared_error_(y_true, y_pred)
        0.375
    >>> mean_squared_error_(y_true, y_pred, squared=False)
        0.612...
    """
    y_true = np.asarray(y_true).flatten()
    y_pred = np.asarray(y_pred).flatten()

    loss = np.mean((y_true - y_pred) ** 2)

    return loss if squared else np.sqrt(loss)
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error_(y_true, y_pred)
0.375
>>> mean_squared_error_(y_true, y_pred, squared=False)
0.6123724356957945

Probabilistic Interpretation#

We can also understand regression metrics through the lens of statistics. For further reading, one should understand the below topics:

In particular, one should have a basic knowledge on empirical risk minimization, that MSE can be understood as the empirical risk (average loss on an observed data set), as an estimate of the true MSE where the true risk refers to the average loss on the actual population distribution)[1].

MAE vs R(MSE)#

For convenience sake, we compare MAE vs MSE and only mention RMSE for some special properties.

Robustness to Outliers#

The naive rule of thumb points to the urban saying that MSE penalizes large errors while MAE does not. Let us see a simple example:

  • \(y = 10\);

  • \(\hat{y}_{1} = 15\)

  • \(\hat{y}_{2} = 20\)

where \(\hat{y}_{1}\) and \(\hat{y}_{2}\) are both predictions made on the ground truth \(y = 10\).

Then we easily see that:

  • \(\textbf{MAE}(y, \hat{y}_{1}) = 5\)

  • \(\textbf{MAE}(y, \hat{y}_{2}) = 10\)

  • \(\textbf{MSE}(y, \hat{y}_{1}) = 25\)

  • \(\textbf{MSE}(y, \hat{y}_{2}) = 100\)

We note that \(\hat{y}_{1}\) is off by 5 and \(\hat{y}_{2}\) is off by 10 from the ground truth. When comparing \(\hat{y}_{1}\) to \(\hat{y}_{2}\), we conclude that \(\hat{y}_{2}\) is off by exactly twice as \(\hat{y}_{1}\).

Now, if we use MAE, we find out that by definition, the loss of \(\hat{y}_{2}\) will also be exactly twice of that \(\hat{y}_{1}\) but for MSE, it will be four times. We then can naively conclude that if your errors being off by 10 if twice as bad as being off by 5, then one should use MAE, if you foresee that being off by 10 is more than twice as bad as being off by 5, then MSE is better. [2]

To put things in perspective, if you are predicting (.. fill in a good example).

Ease of Interpretation#

MAE clearly wins as the interpretation of this metric is simple. The units are on the same scale.

References#