4 min read

covariance and correlation coefficient

We define the covariance of any two random variables \(X\) and \(Y\), written \(Cov(X,Y)\), as: \[\begin{align} Cov(X,Y) &= E(X-\mu_X)(Y-\mu_Y)\\ &= E(XY-X\mu_Y-Y\mu_X+\mu_X\mu_Y)\\ &= E(XY)-\mu_X\mu_Y-\mu_X\mu_Y+\mu_X\mu_Y\\ &= E(XY) - \mu_X\mu_Y\\ &= E(XY) − E(X)E(Y)\\ \end{align}\]. If \(X\) and \(Y\) are independent random variables, \[\begin{align} E(XY)&=\int\int xy\cdot f_{X,Y}(x,y)dxdy\\ &=\int\int xy\cdot f_X(x)f_Y(y)dxdy\\ &=\int x\cdot f_X(x)dx\int y\cdot f_Y(y)dy\\ &=E(X)E(Y) \end{align}\], then \(Cov(X,Y) = E(XY) − E(X)E(Y)=0\)

The Variance of the sum of two random variables \(aX + bY\) is: \[\begin{align} Var(aX + bY) &= E(aX + bY)^2-(E(aX + bY))^2\\ &=E(aX + bY)^2-(a\mu_X+b\mu_Y)^2\\ &=E(a^2X^2+2aXbY+b^2Y^2)-a^2\mu_X^2-2a\mu_Xb\mu_Y-b^2\mu_Y^2\\ &=a^2(E(X^2)-\mu_X^2)+b^2(E(Y^2)-\mu_Y^2)+2ab(E(XY)-\mu_X\mu_Y)\\ &=a^2Var(X)+b^2Var(Y)+2abCov(X,Y) \end{align}\].
Then the Variance of the sum of \(n\) random variables \(W_1,W_2, . . . ,W_n\) is: \[Var(\sum_{i=1}^{n}a_iW_i)=\sum_{i=1}^{n}a_i^2Var(W_i)+2\sum_{i<j}a_ia_jCov(W_i,W_j)\]

The correlation coefficient of any two random variables \(X\) and \(Y\), is denoted with \(\rho(X,Y)\), and given by: \(\rho(X,Y)=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}=Cov(\frac{X-\mu_X}{\sigma_X}, \frac{Y-\mu_Y}{\sigma_Y})\)

For any two random variables \(X_1\) and \(X_2\), if we define the \(Cov(X_1,X_2)\) as \(\sigma_{12}\), then \[\begin{align} Cov(aX_1 + bX_2)&=E(aX_1-a\mu_1)(bX_2-b\mu_2)\\ &=abE(X_1-\mu_1)(X_2-\mu_2)\\ &=abCov(X_1,X_2)\\ &=ab\sigma_{12} \end{align}\] And \[Var(aX_1 + bX_2)=a^2\sigma_{11}+b^2\sigma_{22}+2ab\sigma_{12}\]

We can use vector and matrix to denote \(aX_1 + bX_2\) as: \[\mathbf c^T\mathbf X=\begin{bmatrix} a&b \end{bmatrix}\begin{bmatrix} X_1\\ X_2 \end{bmatrix}\] and denote the variance-covariance matrix as \[\mathbf \Sigma=\begin{bmatrix} \sigma_{11}&\sigma_{12}\\ \sigma_{12}&\sigma_{22} \end{bmatrix}\], then \[\begin{align} Var(aX_1 + bX_2)&=Var(\mathbf c^T\mathbf X)\\ &=a^2\sigma_{11}+b^2\sigma_{22}+2ab\sigma_{12}\\ &=\mathbf c^T\mathbf \Sigma\mathbf c \end{align}\]

\(p\) random variables \[\mathbf X=\begin{bmatrix} X_{1}\\ X_{2}\\ \vdots\\ X_{p}\\ \end{bmatrix}\] has the expectation matrix \[E(\mathbf X)=\boldsymbol\mu_{\mathbf X}=\begin{bmatrix} \mu_{1}\\ \mu_{2}\\ \vdots\\ \mu_{p}\\ \end{bmatrix}\] and the covariance matrix \[\mathbf \Sigma_{\mathbf X}=\begin{bmatrix} \sigma_{11}&\sigma_{12}&\cdots&\sigma_{1p}\\ \sigma_{12}&\sigma_{22}&\cdots&\sigma_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ \sigma_{1p}&\sigma_{2p}&\cdots&\sigma_{pp}\\ \end{bmatrix} \] In general, the \(q\) linear combinations of the \(p\) random variables \(X_1,\cdots,X_p\): \[\begin{array}{c} Z_1&=c_{11}X_1+c_{12}X_2+\cdots+c_{1p}X_p\\ Z_2&=c_{21}X_1+c_{22}X_2+\cdots+c_{2p}X_p\\ &\vdots\\ Z_q&=c_{q1}X_1+c_{q2}X_2+\cdots+c_{qp}X_p\\ \end{array}\] or \[\mathbf Z=\begin{bmatrix} Z_1\\ Z_2\\ \vdots\\ Z_q \end{bmatrix}=\begin{bmatrix} c_{11}&c_{12}&\cdots&c_{1p}\\ c_{21}&c_{22}&\cdots&c_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ c_{q1}&c_{q2}&\cdots&c_{qp} \end{bmatrix}\begin{bmatrix} X_1\\ X_2\\ \vdots\\ X_p \end{bmatrix}=\mathbf C \mathbf X \] The linear combinations \(\mathbf Z=\mathbf C \mathbf X\) have:
\(\boldsymbol\mu_{\mathbf Z}=E(\mathbf Z)=E(\mathbf C \mathbf X)=\mathbf C \boldsymbol\mu_{\mathbf X}\)
\(\mathbf \Sigma_{\mathbf Z}=Cov(\mathbf Z)=Cov(\mathbf C\mathbf X)=\mathbf C\mathbf \Sigma_{\mathbf X}\mathbf C^T\)

If we collect \(n\) sets of measurements on \(p\) variables, and treat the measurements as random variables, the random sample can be defined as: \[\mathbf X=\begin{bmatrix} X_{11}&X_{12}&\cdots&X_{1p}\\ X_{21}&X_{22}&\cdots&X_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ X_{n1}&X_{n2}&\cdots&X_{np} \end{bmatrix}=\begin{bmatrix} \mathbf X_1^T\\ \mathbf X_2^T\\ \mathbf \vdots\\ \mathbf X_n^T \end{bmatrix} \], with each set of measurements \(\mathbf X_j^T\) on \(p\) variables is a random vector and represent independent observations from a common joint distribution with density function \(f(\mathbf x)=f(x_1,x_2,\ldots,x_p)\). Then \(\mathbf {\overline X}=\frac{1}{n}\displaystyle\sum_{j=1}^n\mathbf X_j\)
\[\begin{align} E(\mathbf {\overline X})&=E(\frac{1}{n}\displaystyle\sum_{j=1}^n\mathbf X_j)\\ &=E(\frac{1}{n}\mathbf X_1)+\ldots +E(\frac{1}{n}\mathbf X_n)\\ &=\frac{1}{n}\boldsymbol\mu+\ldots +\frac{1}{n}\boldsymbol\mu\\ &=\boldsymbol\mu \end{align}\] \(\boldsymbol\mu\) is the population mean vector.
\[\begin{align} (\mathbf {\overline X}-\boldsymbol\mu)(\mathbf {\overline X}-\boldsymbol\mu)^T&=(\frac{1}{n}\displaystyle\sum_{j=1}^n\mathbf X_j-\boldsymbol\mu)(\frac{1}{n}\displaystyle\sum_{j=1}^n\mathbf X_j-\boldsymbol\mu)^T\\ &=\Biggl(\frac{1}{n}\displaystyle\sum_{j=1}^n(\mathbf X_j-\boldsymbol\mu)\Biggr)\Biggl(\frac{1}{n}\displaystyle\sum_{k=1}^n(\mathbf X_k-\boldsymbol\mu)\Biggr)^T\\ &=\frac{1}{n^2}\sum_{j=1}^n\sum_{k=1}^n(\mathbf X_j-\boldsymbol\mu)(\mathbf X_k-\boldsymbol\mu)^T\\ \end{align}\] then \[Cov(\mathbf {\overline X})=E((\mathbf {\overline X}-\boldsymbol\mu)(\mathbf {\overline X}-\boldsymbol\mu)^T)=\frac{1}{n^2}\sum_{j=1}^n\sum_{k=1}^nE\Bigl((\mathbf X_j-\boldsymbol\mu)(\mathbf X_k-\boldsymbol\mu)^T\Bigr) \] because for \(j\ne k\), \(\mathbf X_j\) and \(\mathbf X_k\) are independent, so \(E\Bigl((\mathbf X_j-\boldsymbol\mu)(\mathbf X_k-\boldsymbol\mu)^T\Bigr)=0\), then \[Cov(\mathbf {\overline X})=\frac{1}{n^2}\sum_{j=1}^nE\Bigl((\mathbf X_j-\boldsymbol\mu)(\mathbf X_j-\boldsymbol\mu)^T\Bigr)=\frac{1}{n^2}\sum_{j=1}^n\mathbf\Sigma=\frac{1}{n}\mathbf\Sigma\] \(\mathbf\Sigma\) is the population variance–covariance matrix.

To obtain the expected value of \(\mathbf S_n\): \[\begin{align} E(n\mathbf S_n)=E\sum_{j=1}^n\Bigl((\mathbf X_j-\mathbf{\overline X})(\mathbf X_j-\mathbf{\overline X})^T\Bigr)&=E\sum_{j=1}^n\Bigl((\mathbf X_j-\mathbf{\overline X})(\mathbf X_j^T-\mathbf{\overline X}^T)\Bigr)\\ &=E\sum_{j=1}^n\Bigl((\mathbf X_j-\mathbf{\overline X})\mathbf X_j^T-(\mathbf X_j-\mathbf{\overline X})\mathbf{\overline X}^T)\Bigr)\\ &=E\sum_{j=1}^n(\mathbf X_j-\mathbf{\overline X})\mathbf X_j^T-\sum_{j=1}^n(\mathbf X_j-\mathbf{\overline X})\mathbf{\overline X}^T)\\ &=E\sum_{j=1}^n(\mathbf X_j-\mathbf{\overline X})\mathbf X_j^T\\ &=E\Bigl(\sum_{j=1}^n\mathbf X_j\mathbf X_j^T-\sum_{j=1}^n\mathbf{\overline X}\mathbf X_j^T\Bigr)\\ &=E\Bigl(\sum_{j=1}^n\mathbf X_j\mathbf X_j^T-n\mathbf{\overline X}\mathbf{\overline X}^T\Bigr)\\ &=\sum_{j=1}^nE(\mathbf X_j\mathbf X_j^T)-nE(\mathbf{\overline X}\mathbf{\overline X}^T)\\ &=\sum_{j=1}^n(\mathbf\Sigma+\boldsymbol\mu\boldsymbol\mu^T)-n(\frac{1}{n}\mathbf\Sigma+\boldsymbol\mu\boldsymbol\mu^T)\\ &=(n-1)\mathbf\Sigma \end{align}\], then \(E(\mathbf S_n)=\frac{n-1}{n}\mathbf\Sigma\) and the Unbiased Sample Variance–Covariance Matrix \(\mathbf S=\frac{1}{n-1}\sum_{j=1}^n(\mathbf X_j-\mathbf{\overline X})(\mathbf X_j-\mathbf{\overline X})^T\) is the Unbiased estimator of population variance–covariance matrix \(\mathbf \Sigma\) \[\underset{(p,p)}{\mathbf S}=\frac{n}{n-1}\mathbf S_n=\frac{1}{n-1}\sum_{j=1}^n(\mathbf X_j-\mathbf{\overline X})(\mathbf X_j-\mathbf{\overline X})^T\], here \(\underset{(p,p)}{\mathbf S}\) has \((i,k)^{th}\) entry \[s_{i,k}=\frac{1}{n-1}\sum_{j=1}^{n}(X_{ji}-\overline X_i)(X_{jk}-\overline X_k),\quad 0\le i,k\le p\], and contains \(p\) variances and \(\frac{1}{2}(p^2-p)\) covariances.

We can also use matrix operations on the data matrix \[\mathbf X=\begin{bmatrix} x_{11}&x_{12}&\cdots&x_{1p}\\ x_{21}&x_{22}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ x_{n1}&x_{n2}&\cdots&x_{np} \end{bmatrix}\] to calculate \(\overline {\mathbf X}\) and \(\mathbf S\): let \[\mathbf 1=\begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix}\] then \[\overline {\mathbf X}=\frac{1}{n}\mathbf X^T\mathbf 1\] and \[\mathbf 1\overline {\mathbf X}^T=\mathbf 1(\frac{1}{n}\mathbf X^T\mathbf 1)^T=\frac{1}{n}\mathbf 1\mathbf 1^T\mathbf X=\begin{bmatrix} \bar{x_1}&\bar{x_2}&\cdots&\bar{x_p}\\ \bar{x_1}&\bar{x_2}&\cdots&\bar{x_p}\\ \vdots&\vdots&\ddots&\vdots\\ \bar{x_1}&\bar{x_2}&\cdots&\bar{x_p} \end{bmatrix}\] Subtracting this result from \(\mathbf X\) produces the matrix of deviations:\[\mathbf X-\frac{1}{n}\mathbf 1\mathbf 1^T\mathbf X=\begin{bmatrix} x_{11}-\bar{x_1}&x_{12}-\bar{x_2}&\cdots&x_{1p}-\bar{x_p}\\ x_{21}-\bar{x_1}&x_{22}-\bar{x_2}&\cdots&x_{2p}-\bar{x_p}\\ \vdots&\vdots&\ddots&\vdots\\ x_{n1}-\bar{x_1}&x_{n2}-\bar{x_2}&\cdots&x_{np}-\bar{x_p} \end{bmatrix}\] Now, the matrix \((n-1)\mathbf S\) representing sums of squares and cross products: \[\begin{align} (n-1)\mathbf S&=(\mathbf X-\frac{1}{n}\mathbf 1\mathbf 1^T\mathbf X)^T(\mathbf X-\frac{1}{n}\mathbf 1\mathbf 1^T\mathbf X)\\ &=\Bigl((\mathbf 1-\frac{1}{n}\mathbf 1\mathbf 1^T)\mathbf X\Bigr)^T\Bigl((\mathbf 1-\frac{1}{n}\mathbf 1\mathbf 1^T)\mathbf X\Bigr)\\ &=\mathbf X^T(\mathbf 1-\frac{1}{n}\mathbf 1\mathbf 1^T)^T(\mathbf 1-\frac{1}{n}\mathbf 1\mathbf 1^T)\mathbf X\\ &=\mathbf X^T(\mathbf 1-\frac{1}{n}\mathbf 1\mathbf 1^T)\mathbf X\\ \end{align}\] then \[\mathbf S=\frac{1}{n-1}\mathbf X^T(\mathbf 1-\frac{1}{n}\mathbf 1\mathbf 1^T)\mathbf X\]

The \(p\times p\) sample standard deviation matrix is \[ \underset{(p\times p)}{\mathbf D^{\frac{1}{2}}}=\begin{bmatrix} \sqrt{s_{11}}&0&\cdots&0\\ 0&\sqrt{s_{22}}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\sqrt{s_{pp}} \end{bmatrix} \] \[ \underset{(p\times p)}{\mathbf D^{-\frac{1}{2}}}=\begin{bmatrix} \frac{1}{\sqrt{s_{11}}}&0&\cdots&0\\ 0&\frac{1}{\sqrt{s_{22}}}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{1}{\sqrt{s_{pp}}} \end{bmatrix} \] since \[ \underset{(p\times p)}{\mathbf S}=\begin{bmatrix} s_{11}&s_{12}&\cdots&s_{1p}\\ s_{12}&s_{22}&\cdots&s_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ s_{1p}&s_{2p}&\cdots&s_{pp} \end{bmatrix} \] and the sample correlation matrix \[ \underset{(p\times p)}{\mathbf R}=\begin{bmatrix} \frac{s_{11}}{\sqrt{s_{11}}\sqrt{s_{11}}}&\frac{s_{12}}{\sqrt{s_{11}}\sqrt{s_{22}}}&\cdots&\frac{s_{1p}}{\sqrt{s_{11}}\sqrt{s_{pp}}}\\ \frac{s_{12}}{\sqrt{s_{11}}\sqrt{s_{22}}}&\frac{s_{22}}{\sqrt{s_{22}}\sqrt{s_{22}}}&\cdots&\frac{s_{2p}}{\sqrt{s_{22}}\sqrt{s_{pp}}}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{s_{1p}}{\sqrt{s_{11}}\sqrt{s_{pp}}}&\frac{s_{2p}}{\sqrt{s_{22}}\sqrt{s_{pp}}}&\cdots&\frac{s_{pp}}{\sqrt{s_{pp}}\sqrt{s_{pp}}}\\ \end{bmatrix}=\begin{bmatrix} 1&r_{12}&\cdots&r_{1p}\\ r_{12}&1&\cdots&r_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ r_{1p}&r_{2p}&\cdots&1\\ \end{bmatrix} \] we have \[\underset{(p\times p)}{\mathbf R}=\underset{(p\times p)}{\mathbf D^{-\frac{1}{2}}}\underset{(p\times p)}{\mathbf S}\underset{(p\times p)}{\mathbf D^{-\frac{1}{2}}}\] and \[\mathbf S=\mathbf D^{\frac{1}{2}}\mathbf R\mathbf D^{\frac{1}{2}}\]