Concept#

Joint Expectation#

Definition 67 (Joint Expectation)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

The joint expectation is

\[ \mathbb{E}[X Y]=\sum_{y \in \Omega_Y} \sum_{x \in \Omega_X} x y \cdot p_{X, Y}(x, y) \]

if \(X\) and \(Y\) are discrete, or

\[ \mathbb{E}[X Y]=\int_{y \in \Omega_Y} \int_{x \in \Omega_X} x y \cdot f_{X, Y}(x, y) d x d y \]

if \(X\) and \(Y\) are continuous. Joint expectation is also called correlation.

Definition 68 (Cosine Dot Product)

Let \(\boldsymbol{x} \in \mathbb{R}^N\) and \(\boldsymbol{y} \in \mathbb{R}^N\) be two vectors.

The cosine angle \(\cos \theta\) can be defined as

\[ \cos \theta=\frac{\boldsymbol{x}^T \boldsymbol{y}}{\|\boldsymbol{x}\|\|\boldsymbol{y}\|}, \]

where \(\|\boldsymbol{x}\|=\sqrt{\sum_{i=1}^N x_i^2}\) is the norm of the vector \(\boldsymbol{x}\), and \(\|\boldsymbol{y}\|=\sqrt{\sum_{i=1}^N y_i^2}\) is the norm of the vector \(\boldsymbol{y}\).

The inner product \(\boldsymbol{x}^T \boldsymbol{y}\) defines the degree of similarity/correlation between two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\), where the cosine angle \(\cos \theta\) is the cosine of the angle between the two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\).

Theorem 20 (Cauchy-Schwarz Inequality)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then we have,

\[ \mathbb{E}[X Y] \leq \mathbb{E}[X^2] \mathbb{E}[Y^2] \]

We can then view the joint expectation as the cosine dot product between the two random variables. See [Chan, 2021] section 5.2.1, page 259-261.

Covariance and Correlation Coefficient#

Definition 69 (Covariance)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then the covariance of \(X\) and \(Y\) is defined as,

\[ \operatorname{cov}(X, Y)=\mathbb{E}[(X-\mu_X)(Y-\mu_Y)] \]

where \(\mu_X=\mathbb{E}[X]\) and \(\mu_Y=\mathbb{E}[Y]\) are the mean of \(X\) and \(Y\) respectively.

Note that if \(X = Y\), then \(\operatorname{cov}(X, Y)\) can be reduced to the variance of \(X\). Consequently, the covariance is a generalization of the variance.

Theorem 21 (Covariance)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then we have,

\[ \operatorname{Cov}(X, Y)=\mathbb{E}[X Y]-\mathbb{E}[X] \mathbb{E}[Y] \]

Proof. The proof is relative straightforward, we can just apply the definition of Definition 69:

\[\begin{split} \begin{aligned} \operatorname{Cov}(X, Y) &=\mathbb{E}[(X-\mu_X)(Y-\mu_Y)] \\ &=\mathbb{E}[XY - X \mu_Y - Y \mu_X + \mu_X \mu_Y] \\ &=\mathbb{E}[XY] - \mathbb{E}[X \mu_Y] - \mathbb{E}[Y \mu_X] + \mathbb{E}[\mu_X \mu_Y] \\ &=\mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y] - \mathbb{E}[Y] \mathbb{E}[X] + \mathbb{E}[\mu_X] \mathbb{E}[\mu_Y] \\ &=\mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y] - \mathbb{E}[X] \mathbb{E}[Y] + \mathbb{E}[X] \mathbb{E}[Y] \\ &=\mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y] \end{aligned} \end{split}\]

Theorem 22 (Linearity of Covariance)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then we have,

\[ \mathbb{E}\left[\alpha X + \beta Y\right]=\alpha \mathbb{E}[X] + \beta \mathbb{E}[Y] \]

where \(\alpha\) and \(\beta\) are constants.

And,

\[ \operatorname{Var}(\alpha X + \beta Y)=\alpha^2 \operatorname{Var}(X) + \beta^2 \operatorname{Var}(Y) + 2 \alpha \beta \operatorname{Cov}(X, Y) \]

Property 23 (Covariance)

For any two random variables \(X\) and \(Y\) with sample space \(\Omega_X\) and \(\Omega_Y\) respectively, we have the following properties:

  1. \(\operatorname{Cov}(X, Y)=\operatorname{Cov}(Y, X)\)

  2. \(\operatorname{Cov}(X, Y)=0\) if \(X\) and \(Y\) are independent

  3. \(\operatorname{Cov}(X, X)=\operatorname{Var}(X)\)

After we have defined the covariance, we can define the correlation coefficient of \(X\) and \(Y\) formally below. We can treat the correlation coefficient \(\rho\) as the cosine angle of the centralized random variables \(X\) and \(Y\) [Chan, 2021]. Note to fully appreciate why the correlation coefficient is defined as the cosine angle, one can see the derivation in [Chan, 2021] section 5.2.1, page 259-261.

Definition 70 (Correlation Coefficient)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then the correlation coefficient of \(X\) and \(Y\) is defined as,

\[\begin{split} \begin{aligned} \rho(X, Y) &= \cos \theta \\ &= \frac{\mathbb{E}[(X-\mu_X)(Y-\mu_Y)]}{\sqrt{\mathbb{E}[(X-\mu_X)^2] \mathbb{E}[(Y-\mu_Y)^2]}} \\ &= \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X) \operatorname{Var}(Y)}} \end{aligned} \end{split}\]

Property 24 (Correlation Coefficient)

  1. \(-1 \leq \rho(X, Y) \leq 1\), an immediate consequence of the definition of cosine angle.

  2. If \(\rho(X, Y)=1\), then \(X\) and \(Y\) are perfectly positively correlated, in other words, \(Y = \alpha X + \beta\) for some constants \(\alpha\) and \(\beta\), \(\alpha > 0\).

  3. If \(\rho(X, Y)=-1\), then \(X\) and \(Y\) are perfectly negatively correlated, in other words, \(Y = \alpha X + \beta\) for some constants \(\alpha\) and \(\beta\), \(\alpha < 0\).

  4. If \(\rho(X, Y)=0\), then \(X\) and \(Y\) are uncorrelated, in other words, or in linear algebra lingo, \(X\) and \(Y\) are orthogonal and are linearly independent.

  5. \(\rho(\alpha X + \beta, \gamma Y + \delta) = \rho(X, Y)\), where \(\alpha\), \(\beta\), \(\gamma\), and \(\delta\) are constants.

Independence and Correlation Coefficient#

Theorem 23 (Independence and Joint Expectation)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then we have,

\[\begin{split} \begin{aligned} \mathbb{E}[XY] &= \mathbb{E}[X] \mathbb{E}[Y] \\ \end{aligned} \end{split}\]

Theorem 24 (Independence and Covariance)

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Let the following two statements be:

  1. \(X\) and \(Y\) are independent;

  2. \(\operatorname{Cov}(X, Y)=0\).

Then statement 1 implies statement 2, but statement 2 does not imply statement 1. Independence is therefore a stronger condition than correlation [Chan, 2021].

In other words:

  • Independence \(\implies\) Uncorrelated;

  • Uncorrelated \(\not\implies\) Independence.

Empirical (Sample) Correlation Coefficient#

Everything defined previously is for the population, but we can also define the correlation coefficient for the sample via estimation.

Theorem 25 (Empirical Correlation Coefficient)

Given a dataset of \(N\) samples and \(D=1\) features with a target variable \(Y\),

\[ \mathcal{S}_{\{x, y\}} \overset{\mathbf{def}}{=} \left\{\left(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)}\right)\right\} \]

where \(x^{(n)}\) is the \(n\)-th sample and \(y^{(n)}\) is the \(n\)-th target variable.

Then the empirical correlation coefficient of \(X\) and \(Y\) is defined as,

\[ \hat{\rho}\left(\mathcal{S}_{\{x, y\}}\right) = \frac{\sum_{n=1}^N (x^{(n)} - \bar{x})(y^{(n)} - \bar{y})}{\sqrt{\sum_{n=1}^N (x^{(n)} - \bar{x})^2 \sum_{n=1}^N (y^{(n)} - \bar{y})^2}} \]

where \(\bar{x} = \frac{1}{N} \sum_{n=1}^N x^{(n)}\) and \(\bar{y} = \frac{1}{N} \sum_{n=1}^N y^{(n)}\) are the sample mean of \(X\) and \(Y\) respectively.

As \(N \rightarrow \infty\), \(\hat{\rho}\left(\mathcal{S}_{\{x, y\}}\right) \rightarrow \rho(X, Y)\).

In order to generate some plots of correlation, we introduce prematurely the concept of covariance matrix in terms of a \(2 \times 2\) matrix, this can be scaled to higher dimensions as well, which we will learn later.

Definition 71 (Covariance Matrix (2D))

Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.

Then the covariance matrix of \(X\) and \(Y\) is defined as,

\[\begin{split} \begin{aligned} \mathbf{Cov}(\mathbf{X}, \mathbf{Y}) &\overset{\mathbf{def}}{=} \operatorname{Cov}(\mathbf{X}, \mathbf{Y}) \\ &= \begin{bmatrix} \operatorname{Cov}(X, X) & \operatorname{Cov}(X, Y) \\ \operatorname{Cov}(Y, X) & \operatorname{Cov}(Y, Y) \end{bmatrix} \\ \end{aligned} \end{split}\]

In addition, we define a 2D Gaussian distribution categorized by its mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\).

Definition 72 (Multivariate Gaussian Distribution (2D))

Let \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) be the mean vector and covariance matrix of a 2D Gaussian distribution respectively.

Then the multivariate Gaussian distribution is defined as,

\[\begin{split} \begin{aligned} \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2 \pi)^{D/2} \sqrt{\det{\boldsymbol{\Sigma}}}} \exp \left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \\ \end{aligned} \end{split}\]

We then generate some data from a 2D Gaussian distribution with the following parameters:

  • A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 0\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\).

  • A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 0.5\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\).

  • A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 1\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}\).

 1import sys
 2from pathlib import Path
 3parent_dir = str(Path().resolve().parents[1])
 4sys.path.append(parent_dir)
 5
 6from utils import use_svg_display
 7
 8import numpy as np
 9import scipy.stats as stats
10import matplotlib.pyplot as plt
11
12use_svg_display()
13
14mean = [0, 0]
15sigma_1 = [[2, 0], [0, 2]]
16sigma_2 = [[2, 1], [1, 2]]
17sigma_3 = [[2, 2], [2, 2]]
18
19fig, axes = plt.subplots(1, 3, figsize=(12, 4))
20x1 = stats.multivariate_normal.rvs(mean, sigma_1, 1000)
21x2 = stats.multivariate_normal.rvs(mean, sigma_2, 1000)
22x3 = stats.multivariate_normal.rvs(mean, sigma_3, 1000)
23
24rho_1, _ = stats.pearsonr(x1[:, 0], x1[:, 1])
25rho_2, _ = stats.pearsonr(x2[:, 0], x2[:, 1])
26rho_3, _ = stats.pearsonr(x3[:, 0], x3[:, 1])
27
28for ax, x, rho in zip(axes, [x1, x2, x3], [rho_1, rho_2, rho_3]):
29    ax.scatter(x[:, 0], x[:, 1], color="blue", s=10)
30    ax.set_title(f"Empirical ρ = {rho:.2f}")
31    ax.set_xlim(-5, 5)
32    ax.set_ylim(-5, 5)
33
34plt.show()
../../_images/36e3d96afc5365735dcb660b0774a3cf8497e6eed740b2b24945391ae4d79763.svg

Notice that the empirical correlation coefficient \(\hat{\rho}(\mathcal{S}_{\{x, y\}})\) is close to the true correlation coefficient \(\rho(X, Y)\) with \(N\) large.