Concept
Contents
Concept#
Joint Expectation#
(Joint Expectation)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
The joint expectation is
if \(X\) and \(Y\) are discrete, or
if \(X\) and \(Y\) are continuous. Joint expectation is also called correlation.
(Cosine Dot Product)
Let \(\boldsymbol{x} \in \mathbb{R}^N\) and \(\boldsymbol{y} \in \mathbb{R}^N\) be two vectors.
The cosine angle \(\cos \theta\) can be defined as
where \(\|\boldsymbol{x}\|=\sqrt{\sum_{i=1}^N x_i^2}\) is the norm of the vector \(\boldsymbol{x}\), and \(\|\boldsymbol{y}\|=\sqrt{\sum_{i=1}^N y_i^2}\) is the norm of the vector \(\boldsymbol{y}\).
The inner product \(\boldsymbol{x}^T \boldsymbol{y}\) defines the degree of similarity/correlation between two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\), where the cosine angle \(\cos \theta\) is the cosine of the angle between the two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\).
(Cauchy-Schwarz Inequality)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
We can then view the joint expectation as the cosine dot product between the two random variables. See [Chan, 2021] section 5.2.1, page 259-261.
Covariance and Correlation Coefficient#
(Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then the covariance of \(X\) and \(Y\) is defined as,
where \(\mu_X=\mathbb{E}[X]\) and \(\mu_Y=\mathbb{E}[Y]\) are the mean of \(X\) and \(Y\) respectively.
Note that if \(X = Y\), then \(\operatorname{cov}(X, Y)\) can be reduced to the variance of \(X\). Consequently, the covariance is a generalization of the variance.
(Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
Proof. The proof is relative straightforward, we can just apply the definition of Definition 69:
(Linearity of Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
where \(\alpha\) and \(\beta\) are constants.
And,
(Covariance)
For any two random variables \(X\) and \(Y\) with sample space \(\Omega_X\) and \(\Omega_Y\) respectively, we have the following properties:
\(\operatorname{Cov}(X, Y)=\operatorname{Cov}(Y, X)\)
\(\operatorname{Cov}(X, Y)=0\) if \(X\) and \(Y\) are independent
\(\operatorname{Cov}(X, X)=\operatorname{Var}(X)\)
After we have defined the covariance, we can define the correlation coefficient of \(X\) and \(Y\) formally below. We can treat the correlation coefficient \(\rho\) as the cosine angle of the centralized random variables \(X\) and \(Y\) [Chan, 2021]. Note to fully appreciate why the correlation coefficient is defined as the cosine angle, one can see the derivation in [Chan, 2021] section 5.2.1, page 259-261.
(Correlation Coefficient)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then the correlation coefficient of \(X\) and \(Y\) is defined as,
(Correlation Coefficient)
\(-1 \leq \rho(X, Y) \leq 1\), an immediate consequence of the definition of cosine angle.
If \(\rho(X, Y)=1\), then \(X\) and \(Y\) are perfectly positively correlated, in other words, \(Y = \alpha X + \beta\) for some constants \(\alpha\) and \(\beta\), \(\alpha > 0\).
If \(\rho(X, Y)=-1\), then \(X\) and \(Y\) are perfectly negatively correlated, in other words, \(Y = \alpha X + \beta\) for some constants \(\alpha\) and \(\beta\), \(\alpha < 0\).
If \(\rho(X, Y)=0\), then \(X\) and \(Y\) are uncorrelated, in other words, or in linear algebra lingo, \(X\) and \(Y\) are orthogonal and are linearly independent.
\(\rho(\alpha X + \beta, \gamma Y + \delta) = \rho(X, Y)\), where \(\alpha\), \(\beta\), \(\gamma\), and \(\delta\) are constants.
Independence and Correlation Coefficient#
(Independence and Joint Expectation)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
(Independence and Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Let the following two statements be:
\(X\) and \(Y\) are independent;
\(\operatorname{Cov}(X, Y)=0\).
Then statement 1 implies statement 2, but statement 2 does not imply statement 1. Independence is therefore a stronger condition than correlation [Chan, 2021].
In other words:
Independence \(\implies\) Uncorrelated;
Uncorrelated \(\not\implies\) Independence.
Empirical (Sample) Correlation Coefficient#
Everything defined previously is for the population, but we can also define the correlation coefficient for the sample via estimation.
(Empirical Correlation Coefficient)
Given a dataset of \(N\) samples and \(D=1\) features with a target variable \(Y\),
where \(x^{(n)}\) is the \(n\)-th sample and \(y^{(n)}\) is the \(n\)-th target variable.
Then the empirical correlation coefficient of \(X\) and \(Y\) is defined as,
where \(\bar{x} = \frac{1}{N} \sum_{n=1}^N x^{(n)}\) and \(\bar{y} = \frac{1}{N} \sum_{n=1}^N y^{(n)}\) are the sample mean of \(X\) and \(Y\) respectively.
As \(N \rightarrow \infty\), \(\hat{\rho}\left(\mathcal{S}_{\{x, y\}}\right) \rightarrow \rho(X, Y)\).
In order to generate some plots of correlation, we introduce prematurely the concept of covariance matrix in terms of a \(2 \times 2\) matrix, this can be scaled to higher dimensions as well, which we will learn later.
(Covariance Matrix (2D))
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then the covariance matrix of \(X\) and \(Y\) is defined as,
In addition, we define a 2D Gaussian distribution categorized by its mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\).
(Multivariate Gaussian Distribution (2D))
Let \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) be the mean vector and covariance matrix of a 2D Gaussian distribution respectively.
Then the multivariate Gaussian distribution is defined as,
We then generate some data from a 2D Gaussian distribution with the following parameters:
A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 0\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\).
A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 0.5\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\).
A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 1\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}\).
1import sys
2from pathlib import Path
3parent_dir = str(Path().resolve().parents[1])
4sys.path.append(parent_dir)
5
6from utils import use_svg_display
7
8import numpy as np
9import scipy.stats as stats
10import matplotlib.pyplot as plt
11
12use_svg_display()
13
14mean = [0, 0]
15sigma_1 = [[2, 0], [0, 2]]
16sigma_2 = [[2, 1], [1, 2]]
17sigma_3 = [[2, 2], [2, 2]]
18
19fig, axes = plt.subplots(1, 3, figsize=(12, 4))
20x1 = stats.multivariate_normal.rvs(mean, sigma_1, 1000)
21x2 = stats.multivariate_normal.rvs(mean, sigma_2, 1000)
22x3 = stats.multivariate_normal.rvs(mean, sigma_3, 1000)
23
24rho_1, _ = stats.pearsonr(x1[:, 0], x1[:, 1])
25rho_2, _ = stats.pearsonr(x2[:, 0], x2[:, 1])
26rho_3, _ = stats.pearsonr(x3[:, 0], x3[:, 1])
27
28for ax, x, rho in zip(axes, [x1, x2, x3], [rho_1, rho_2, rho_3]):
29 ax.scatter(x[:, 0], x[:, 1], color="blue", s=10)
30 ax.set_title(f"Empirical ρ = {rho:.2f}")
31 ax.set_xlim(-5, 5)
32 ax.set_ylim(-5, 5)
33
34plt.show()
Notice that the empirical correlation coefficient \(\hat{\rho}(\mathcal{S}_{\{x, y\}})\) is close to the true correlation coefficient \(\rho(X, Y)\) with \(N\) large.