7 min read

The Multivariate Normal Density

The univariate normal pdf is: \[f_X(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}, \quad -\infty<x<+\infty\] The term \((\frac{x-\mu}{\sigma})^2=(x-\mu)(\sigma^2)^{-1}(x-\mu)\) measures the square of the univariate distance from \(x\) to \(\mu\) in standard deviation units. This can be generalized to a \(p\times 1\) vector \(\mathbf x\) of observations on several variables as \((\mathbf X-\boldsymbol \mu)^T(\mathbf \Sigma)^{-1}(\mathbf X-\boldsymbol \mu)\), which is the square of the multivariate generalized distance from \(\mathbf X\) to \(\boldsymbol \mu\), the \(p\times p\) matrix \(\mathbf \Sigma\) is the variance–covariance matrix of \(\mathbf X\). Because the volume under the surface of the multivariate density function unity for any \(p\) is \((2\pi)^{-p/2}|\boldsymbol\Sigma|^{-1/2}\), then a \(p\)-dimensional normal density for the random vector \(\mathbf X^T=[X_1,X_2,\cdots,X_p]\) has the form \[ f_X(\mathbf x)=\frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}}e^{-\frac{1}{2}(\mathbf x-\boldsymbol \mu)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu)}\], which can be denoted by \(N_p(\boldsymbol \mu, \mathbf\Sigma)\).

For the bivariate normal distribution, \(X_1\) and \(X_2\) are random variables with joint pdf \[\begin{align} f_{X_1,X_2}(x_1, x_2)&=\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{1-\rho_{12}^2}\Bigl[(\frac{x_1 − \mu_1}{\sqrt{\sigma_{11}}})^2-2\rho_{12}(\frac{x_1 − \mu_1}{\sqrt{\sigma_{11}}})(\frac{x_2 − \mu_2}{\sqrt{\sigma_{22}}}) +(\frac{x_2 − \mu_2}{\sqrt{\sigma_{22}}})^2\Bigr]\Biggr] \end{align}\] and \(\rho_{12}=\frac{Cov(X_1,X_2)}{\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}}=\frac{\sigma_{12}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}}\) is the correlation coefficient between \(X\) and \(Y\). The inverse of the covariance matrix \[\mathbf \Sigma=\begin{bmatrix} \sigma_{11}&\sigma_{12}\\ \sigma_{12}&\sigma_{22} \end{bmatrix} \] is \[\mathbf \Sigma^{-1}=\frac{1}{\sigma_{11}\sigma_{22}-\sigma_{12}^2}\begin{bmatrix} \sigma_{22}&-\sigma_{12}\\ -\sigma_{12}&\sigma_{11} \end{bmatrix} \], because \(\sigma_{11}\sigma_{22}-\sigma_{12}^2=\sigma_{11}\sigma_{22}(1-\rho_{12}^2)\), the squared distance becomes \[\begin{align} (\mathbf x-\boldsymbol \mu)^T(\mathbf \Sigma)^{-1}(\mathbf x-\boldsymbol \mu)&=\begin{bmatrix} x_1-\mu_1&x_2-\mu_2 \end{bmatrix}\frac{1}{\sigma_{11}\sigma_{22}(1-\rho_{12}^2)}\begin{bmatrix} \sigma_{22}&-\sigma_{12}\\ -\sigma_{12}&\sigma_{11} \end{bmatrix}\begin{bmatrix} x_1-\mu_1\\ x_2-\mu_2 \end{bmatrix}\\ &=\begin{bmatrix} x_1-\mu_1&x_2-\mu_2 \end{bmatrix}\frac{1}{\sigma_{11}\sigma_{22}(1-\rho_{12}^2)}\begin{bmatrix} \sigma_{22}&-\rho_{12}\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}\\ -\rho_{12}\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}&\sigma_{11} \end{bmatrix}\begin{bmatrix} x_1-\mu_1\\ x_2-\mu_2 \end{bmatrix}\\ &=\frac{\sigma_{22}(x_1-\mu_1)^2+\sigma_{11}(x_2-\mu_2)^2-2\rho_{12}\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}(x_1-\mu_1)(x_2-\mu_2)}{\sigma_{11}\sigma_{22}(1-\rho_{12}^2)}\\ &=\frac{1}{(1-\rho_{12}^2)}\Biggl[\Biggl(\frac{x_1-\mu_1}{\sqrt{\sigma_{11}}}\Biggr)^2+\Biggl(\frac{x_2-\mu_2}{\sqrt{\sigma_{22}}}\Biggr)^2-2\rho_{12}\Biggl(\frac{x_1-\mu_1}{\sqrt{\sigma_{11}}}\Biggr)\Biggl(\frac{x_2-\mu_2}{\sqrt{\sigma_{22}}}\Biggr)\Biggr] \end{align}\] since \(|\mathbf \Sigma|=\sigma_{11}\sigma_{22}-\sigma_{12}^2=\sigma_{11}\sigma_{22}(1-\rho^2)\), then the compact general form of bivariate normal distribution is \[f_{X_1,X_2}(x_1, x_2)=f(\mathbf x)=\frac{1}{2\pi|\mathbf\Sigma|^{1/2}}e^{-\frac{1}{2}(\mathbf x-\boldsymbol \mu)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu)}\]

\(\{ \text{all } \mathbf x \text{ that } (\mathbf x-\boldsymbol \mu)^T(\mathbf \Sigma)^{-1}(\mathbf x-\boldsymbol \mu)=c^2 \}\) is called Constant probability density contour.

If \((\lambda, \mathbf e),\mathbf e\ne0\) is an eigenvalue–eigenvector pair for positive definite matrix \(\mathbf \Sigma\), then \(0<\mathbf e^T\mathbf\Sigma\mathbf e=\mathbf e^T\lambda\mathbf e=\lambda\mathbf e^T\mathbf e=\lambda\), and \(\mathbf e=\mathbf \Sigma^{-1}\mathbf \Sigma\mathbf e=\mathbf \Sigma^{-1}\lambda\mathbf e\Rightarrow\frac{1}{\lambda}\mathbf e=\Sigma^{-1}\mathbf e\), which shows \((\frac{1}{\lambda}, \mathbf e)\) is an eigenvalue–eigenvector pair for matrix \(\mathbf \Sigma^{-1}\), which is also positive definite, because \[\begin{align} \mathbf x^T\mathbf \Sigma^{-1}\mathbf x&=\mathbf x^T\Biggl(\displaystyle\sum_{i=1}^{p}\frac{1}{\lambda_i}\mathbf e_i\mathbf e_i^T\Biggr)\mathbf x\\ &=\displaystyle\sum_{i=1}^{p}\frac{1}{\lambda_i}\mathbf x^T(\mathbf e_i\mathbf e_i^T)\mathbf x\\ &=\displaystyle\sum_{i=1}^{p}\frac{1}{\lambda_i}(\mathbf x^T\mathbf e_i)^2\ge0 \end{align}\] Then \((\mathbf x-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf x-\boldsymbol \mu)=c^2\) are ellipsoids centered at \(\boldsymbol \mu\) and have axes \(\pm c\sqrt{\lambda_i}\mathbf e_i\).

  • If \(\mathbf X\) is distributed as \(N_p(\boldsymbol \mu, \mathbf \Sigma)\) with \(|\mathbf\Sigma|>0\), then \((\mathbf X-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)\) is distributed as \(\chi_p^2\). Because \[\begin{align} (\mathbf X-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)&=(\mathbf X-\boldsymbol \mu)^T\sum_{i=1}^{p}\frac{1}{\lambda_i}\mathbf e_i\mathbf e_i^T(\mathbf X-\boldsymbol \mu)\\ &=\sum_{i=1}^{p}\frac{1}{\lambda_i}(\mathbf X-\boldsymbol \mu)^T\mathbf e_i\mathbf e_i^T(\mathbf X-\boldsymbol \mu)\\ &=\sum_{i=1}^{p}\frac{1}{\lambda_i}(\mathbf e_i^T(\mathbf X-\boldsymbol \mu))^2\\ &=\sum_{i=1}^{p}(\frac{1}{\sqrt{\lambda_i}}\mathbf e_i^T(\mathbf X-\boldsymbol \mu))^2\\ &=\sum_{i=1}^{p}\mathbf Z_i^2 \end{align}\] We can denote \(\mathbf Z=\mathbf A(\mathbf X-\boldsymbol \mu)=\begin{bmatrix} \frac{1}{\sqrt{\lambda_1}}\mathbf e_1^T\\ \frac{1}{\sqrt{\lambda_2}}\mathbf e_2^T\\ \vdots\\ \frac{1}{\sqrt{\lambda_p}}\mathbf e_p^T\\ \end{bmatrix}(\mathbf X-\boldsymbol \mu)\), then \(\mathbf Z\) is distributed as \(N_p(\mathbf 0, \mathbf A\mathbf \Sigma\mathbf A^T)\), where \(\mathbf A\mathbf \Sigma\mathbf A^T=\begin{bmatrix} \frac{1}{\sqrt{\lambda_1}}\mathbf e_1^T\\ \frac{1}{\sqrt{\lambda_2}}\mathbf e_2^T\\ \vdots\\ \frac{1}{\sqrt{\lambda_p}}\mathbf e_p^T\\ \end{bmatrix}\begin{bmatrix} \displaystyle\sum_{i=1}^{p}\lambda_i\mathbf e_i\mathbf e_i^T \end{bmatrix}\begin{bmatrix} \frac{1}{\sqrt{\lambda_1}}\mathbf e_1^T& \frac{1}{\sqrt{\lambda_2}}\mathbf e_2^T& \cdots& \frac{1}{\sqrt{\lambda_p}}\mathbf e_p^T \end{bmatrix}=\mathbf I\), so \(Z_i\) are independent standard normal variables.

  • When \(\mathbf X\) is distributed as \(N_p(\boldsymbol \mu, \mathbf \Sigma)\), \((\mathbf X-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)\) is the squared statistical distance from \(\mathbf X\) to \(\boldsymbol \mu\).

  • If \(\mathbf X\) is distributed as \(N_p(\boldsymbol \mu, \mathbf \Sigma)\), any linear combination of the components of \(\mathbf X\), \(\mathbf a^T\mathbf X=a_1X_1+a_2X_2+\cdots+a_pX_p\) is also normally distributed as \(N_p(\mathbf a^T\boldsymbol \mu, \mathbf a^T\mathbf \Sigma\mathbf a)\). \[\begin{align} \mathbf a^T\mathbf \Sigma\mathbf a&=\begin{bmatrix} a_{11}&a_{12}&\cdots&a_{1p} \end{bmatrix}\begin{bmatrix} \sigma_{11}&\sigma_{12}&\cdots&\sigma_{1p}\\ \sigma_{12}&\sigma_{22}&\cdots&\sigma_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ \sigma_{1p}&\sigma_{2p}&\cdots&\sigma_{pp}\\ \end{bmatrix}\begin{bmatrix} a_{11}\\ a_{12}\\ \vdots\\ a_{1p} \end{bmatrix}\\ \end{align}\]

  • If \(\mathbf X\) is distributed as \(N_p(\boldsymbol \mu, \mathbf \Sigma)\), the \(q\) linear combination of the components of \(\mathbf X\), \[\underset{(q\times p)}{\mathbf A}\underset{(p\times 1)}{\mathbf X}=\begin{bmatrix} a_{11}X_1+a_{12}X_2+\cdots+a_{1p}X_p\\ a_{21}X_1+a_{22}X_2+\cdots+a_{2p}X_p\\ \vdots\\ a_{q1}X_1+a_{q2}X_2+\cdots+a_{qp}X_p\\ \end{bmatrix}\] are also normally distributed as \(N_q(\underset{q\times p}{\mathbf A}\underset{p\times 1}{\boldsymbol \mu}, \underset{q\times p}{\mathbf A}\underset{p\times p}{\mathbf \Sigma}\underset{p\times q}{\mathbf A^T})\).

  • The marginal distribution of any component \(X_i, i\in \{1,\cdots,p\}\) of \(\mathbf X\) is \(N_p(\mu_i, \sigma_{ii})\).

  • The conditional density of \(X_1\) given that \(X_2=x_2\) for any bivariate distribution is \[\begin{align} f(X_1|x_2)&=\frac{f(x_1, x_2)}{f(x_2)}\\ &=\frac{\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{1-\rho_{12}^2}\Bigl[(\frac{x_1 − \mu_1}{\sqrt{\sigma_{11}}})^2-2\rho_{12}(\frac{x_1 − \mu_1}{\sqrt{\sigma_{11}}})(\frac{x_2 − \mu_2}{\sqrt{\sigma_{22}}}) +(\frac{x_2 − \mu_2}{\sqrt{\sigma_{22}}})^2\Bigr]\Biggr]}{\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{22}}}e^{-\frac{1}{2}\frac{(x_2-\mu_2)^2}{\sigma_{22}}}}\\ &=\frac{\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{\sigma_{11}(1-\rho_{12}^2)}\Bigl[(x_1 − \mu_1)^2-2\rho_{12}\frac{(x_1 − \mu_1)(x_2 − \mu_2)\sqrt{\sigma_{11}}}{\sqrt{\sigma_{22}}}+\frac{(x_2 − \mu_2)^2\sigma_{11}}{\sigma_{22}}\Bigr]\Biggr]}{\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{22}}}e^{-\frac{1}{2}\frac{(x_2-\mu_2)^2}{\sigma_{22}}}}\\ &=\frac{\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{\sigma_{11}(1-\rho_{12}^2)}\Bigl[\Bigl((x_1 − \mu_1)-\rho_{12}\frac{\sqrt{\sigma_{11}}}{\sqrt{\sigma_{22}}}(x_2 − \mu_2)\Bigr)^2-\rho_{12}^2\frac{\sigma_{11}}{\sigma_{22}}(x_2 − \mu_2)^2+\frac{(x_2 − \mu_2)^2\sigma_{11}}{\sigma_{22}}\Bigr]\Biggr]}{\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{22}}}e^{-\frac{1}{2}\frac{(x_2-\mu_2)^2}{\sigma_{22}}}}\\ &=\frac{\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{\sigma_{11}(1-\rho_{12}^2)}\Bigl[\Bigl((x_1 − \mu_1)-\rho_{12}\frac{\sqrt{\sigma_{11}}}{\sqrt{\sigma_{22}}}(x_2 − \mu_2)\Bigr)^2+(1-\rho_{12}^2)\frac{\sigma_{11}}{\sigma_{22}}(x_2 − \mu_2)^2\Bigr]\Biggr]}{\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{22}}}e^{-\frac{1}{2}\frac{(x_2-\mu_2)^2}{\sigma_{22}}}}\\ &=\frac{\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{\sigma_{11}(1-\rho_{12}^2)}\Bigl[\Bigl(x_1 − \mu_1-\frac{\sigma_{11}}{\sigma_{22}}(x_2 − \mu_2)\Bigr)^2+(1-\rho_{12}^2)\frac{\sigma_{11}}{\sigma_{22}}(x_2 − \mu_2)^2\Bigr]\Biggr]}{\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{22}}}e^{-\frac{1}{2}\frac{(x_2-\mu_2)^2}{\sigma_{22}}}}\\ &=\frac{\frac{1}{2\pi\sqrt{\sigma_{11}\sigma_{22}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{\sigma_{11}(1-\rho_{12}^2)}\Bigl(x_1−\mu_1-\frac{\sigma_{11}}{\sigma_{22}}(x_2−\mu_2)\Bigr)^2-\frac{1}{2}\frac{1}{\sigma_{22}}(x_2−\mu_2)^2\Biggr]}{\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{22}}}e^{-\frac{1}{2}\frac{(x_2-\mu_2)^2}{\sigma_{22}}}}\\ &=\frac{1}{\sqrt{2\pi}\sqrt{\sigma_{11}}\sqrt{1-\rho_{12}^2}}exp\Biggl[-\frac{1}{2}\frac{1}{\sigma_{11}(1-\rho_{12}^2)}\Bigl(x_1−\mu_1-\frac{\sigma_{11}}{\sigma_{22}}(x_2−\mu_2)\Bigr)^2\Biggr] \end{align}\] which is a normal distribution \(N(\mu_1+\frac{\sigma_{11}}{\sigma_{22}}(x_2−\mu_2),\sigma_{11}(1-\rho_{12}^2))\), and \(\mu_1+\frac{\sigma_{11}}{\sigma_{22}}(x_2−\mu_2)=\mu_1+\mathbf \Sigma_{12}\mathbf \Sigma_{22}^{-1}(x_2−\mu_2)\), \(\sigma_{11}(1-\rho_{12}^2)=\sigma_{11}-\sigma_{11}\rho_{12}^2=\sigma_{11}-\sigma_{12}^2/\sigma_{22}=\mathbf \Sigma_{11}-\mathbf \Sigma_{12}\mathbf \Sigma_{22}^{-1}\mathbf \Sigma_{21}\)

  • The above result can be generalized to: if \(\mathbf X=\begin{bmatrix} \mathbf X_1\\ \hdashline \mathbf X_2\\ \end{bmatrix}\) is distributed as \(N_p(\boldsymbol\mu, \mathbf\Sigma)\) with \(\boldsymbol\mu=\begin{bmatrix} \boldsymbol\mu_1\\ \hdashline \boldsymbol\mu_2\\ \end{bmatrix}\), \(\mathbf\Sigma=\left[\begin{array}{c:c} \mathbf\Sigma_{11}&\mathbf\Sigma_{12}\\ \hdashline \mathbf\Sigma_{21}&\mathbf\Sigma_{22}\\ \end{array} \right]\) and \(|\mathbf\Sigma_{22}|>0\), then the conditional distribution of \(\mathbf X_1\) given that \(\mathbf X_2=\mathbf x_2\), is a normal distribution with \(\boldsymbol\mu=\boldsymbol\mu_1+\mathbf \Sigma_{12}\mathbf \Sigma_{22}^{-1}(\mathbf x_2−\boldsymbol\mu_2)\) and \(\mathbf\Sigma=\mathbf \Sigma_{11}-\mathbf \Sigma_{12}\mathbf \Sigma_{22}^{-1}\mathbf \Sigma_{21}\), which does not depend on the values of the conditioning variables.

  • Because \((\mathbf X-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)\) is a scalar, so \((\mathbf X-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)=tr[(\mathbf X-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)]=tr[\mathbf \Sigma^{-1}(\mathbf X-\boldsymbol \mu)(\mathbf X-\boldsymbol \mu)^T]\) then \[\begin{align} \displaystyle\sum_{j=1}^{n}(\mathbf X_j-\boldsymbol \mu)^T\mathbf \Sigma^{-1}(\mathbf X_j-\boldsymbol\mu)&=\displaystyle\sum_{i=1}^{n}tr\Bigl[\mathbf \Sigma^{-1}(\mathbf X_j-\boldsymbol\mu)(\mathbf X_j-\boldsymbol \mu)^T\Bigr]\\ &=tr\Bigl[\mathbf \Sigma^{-1}\sum_{j=1}^{n}(\mathbf X_j-\boldsymbol\mu)(\mathbf X_j-\boldsymbol \mu)^T\Bigr]\\ &=tr\Biggl[\mathbf \Sigma^{-1}\sum_{j=1}^{n}(\mathbf X_j-\overline{\mathbf X}+\overline{\mathbf X}-\boldsymbol\mu)(\mathbf X_j-\overline{\mathbf X}+\overline{\mathbf X}-\boldsymbol \mu)^T\Biggr]\\ &=tr\Biggl[\mathbf\Sigma^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf X_j-\overline{\mathbf X})(\mathbf X_j-\overline{\mathbf X})^T+\sum_{j=1}^{n}(\overline{\mathbf X}-\boldsymbol\mu)(\overline{\mathbf X}-\boldsymbol \mu)^T\Biggr)\Biggr]\\ &=tr\Biggl[\mathbf\Sigma^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf X_j-\overline{\mathbf X})(\mathbf X_j-\overline{\mathbf X})^T+n(\overline{\mathbf X}-\boldsymbol\mu)(\overline{\mathbf X}-\boldsymbol \mu)^T\Biggr)\Biggr]\\ &=tr\Biggl[\mathbf\Sigma^{-1}\sum_{j=1}^{n}(\mathbf X_j-\overline{\mathbf X})(\mathbf X_j-\overline{\mathbf X})^T\Biggr]+n\cdot tr\Biggl[\mathbf\Sigma^{-1}(\overline{\mathbf X}-\boldsymbol\mu)(\overline{\mathbf X}-\boldsymbol \mu)^T\Biggr]\\ &=tr\Biggl[\mathbf\Sigma^{-1}\sum_{j=1}^{n}(\mathbf X_j-\overline{\mathbf X})(\mathbf X_j-\overline{\mathbf X})^T\Biggr]+n\cdot tr\Biggl[(\overline{\mathbf X}-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\overline{\mathbf X}-\boldsymbol\mu)\Biggr]\\ &=tr\Biggl[\mathbf\Sigma^{-1}\sum_{j=1}^{n}(\mathbf X_j-\overline{\mathbf X})(\mathbf X_j-\overline{\mathbf X})^T\Biggr]+n\Biggl[(\overline{\mathbf X}-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\overline{\mathbf X}-\boldsymbol\mu)\Biggr]\\ \end{align}\] Because \(\mathbf\Sigma^{-1}\) is positive definite, so \((\overline{\mathbf X}-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\overline{\mathbf X}-\boldsymbol\mu)\ge0\) and \((\overline{\mathbf X}-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\overline{\mathbf X}-\boldsymbol\mu)=0\) when \(\hat{\boldsymbol \mu}=\overline{\mathbf X}\)

  • For \(p\times p\) symmetric positive definite matrix \(\mathbf A\) and a scalar \(a>0\) and all positive definite \(\underset{(p\times p)}{\mathbf \Sigma}\), let \(\lambda_i\) are the eigenvalues of \(\mathbf A^{1/2}\mathbf\Sigma^{-1}\mathbf A^{1/2}\), then \[\begin{align} \frac{1}{|\mathbf \Sigma|^a}e^{-\frac{1}{2}tr(\mathbf\Sigma^{-1}\mathbf A)}&=\frac{1}{|\mathbf \Sigma|^a}e^{-\frac{1}{2}tr(\mathbf\Sigma^{-1}\mathbf A^{1/2}\mathbf A^{1/2})}\\ &=\frac{1}{|\mathbf \Sigma|^a}e^{-\frac{1}{2}tr(\mathbf A^{1/2}\mathbf\Sigma^{-1}\mathbf A^{1/2})}\\ &=\frac{1}{|\mathbf \Sigma|^a}e^{-\frac{1}{2}tr(\displaystyle\sum_{i=1}^{p}\lambda_i)}\\ \end{align}\] and \[\begin{align} \prod_{i=1}^{p}\lambda_i=|\mathbf A^{1/2}\mathbf\Sigma^{-1}\mathbf A^{1/2}|&=|\mathbf A^{1/2}||\mathbf\Sigma^{-1}||\mathbf A^{1/2}|\\ &=|\mathbf\Sigma^{-1}||\mathbf A^{1/2}||\mathbf A^{1/2}|\\ &=|\mathbf\Sigma^{-1}||\mathbf A|\\ &=\frac{1}{|\mathbf\Sigma|}|\mathbf A|\\ \end{align}\] then \(\frac{1}{|\mathbf\Sigma|}=\frac{\displaystyle\prod_{i=1}^{p}\lambda_i}{|\mathbf A|}\) and \[\begin{align} \frac{1}{|\mathbf \Sigma|^a}e^{-\frac{1}{2}tr(\mathbf\Sigma^{-1}\mathbf A)}&=\Biggl(\frac{\displaystyle\prod_{i=1}^{p}\lambda_i}{|\mathbf A|}\Biggr)^ae^{-\frac{1}{2}\sum_{i=1}^{p}\lambda_i}\\ &=\frac{1}{|\mathbf A|^a}\displaystyle\prod_{i=1}^{p}\lambda_i^ae^{-\frac{1}{2}\lambda_i} \end{align}\] Because \(\lambda_i^ae^{-\frac{1}{2}\lambda_i}\) has a maximum \((2a)^ae^{-a}\) when \(\lambda_i=2a\), so \(\frac{1}{|\mathbf \Sigma|^a}e^{-\frac{1}{2}tr(\mathbf\Sigma^{-1}\mathbf A)}\le\frac{1}{|\mathbf A|^a}(2a)^{ap}e^{-ap}\), when \(\lambda_i=2a\), \(\frac{1}{|\mathbf\Sigma|}=\frac{\displaystyle\prod_{i=1}^{p}\lambda_i}{|\mathbf A|}=\frac{(2a)^p}{|\mathbf A|}\) and \(tr(\mathbf\Sigma^{-1}\mathbf A)=2ap\) and \(\mathbf\Sigma=\frac{1}{2a}\mathbf A\)

  • Let \(\mathbf X_1, \mathbf X_2, \cdots, \mathbf X_n\) be a random sample from a normal population with mean \(\boldsymbol \mu\) and covariance \(\mathbf\Sigma\), the likelihood function is \[\begin{align} L(\boldsymbol \mu,\mathbf\Sigma)&=\prod_{j=1}^{n}\Biggl[\frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}}e^{-\frac{1}{2}(\mathbf x_j-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\mathbf x_j-\boldsymbol \mu)}\Biggr]\\ &=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\mathbf x_j-\boldsymbol \mu)}\\ &=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\mathbf\Sigma^{-1}\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol \mu)^T(\mathbf x_j-\boldsymbol\mu)\Biggr]}\\ &=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\mathbf\Sigma^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T+n(\overline{\mathbf x}-\boldsymbol\mu)(\overline{\mathbf x}-\boldsymbol\mu)^T\Biggr)\Biggr]}\\ \end{align}\] the likelihood function maximized with respect to \(\hat{\boldsymbol \mu}=\overline{\mathbf x}\) \(L(\boldsymbol \mu,\mathbf\Sigma)=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\mathbf\Sigma^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T\Biggr)\Biggr]}\) and maximized with respect to \(\hat{\mathbf\Sigma}=\frac{1}{2n/2}\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T=\frac{1}{n}\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T=\frac{n-1}{n}\mathbf S\), the maximized likelihood function is \(L(\hat{\boldsymbol\mu},\hat{\mathbf\Sigma})=\frac{1}{(2\pi)^{np/2}}\frac{1}{|\hat{\mathbf\Sigma}|^{n/2}}e^{-np/2}\)

  • For the sample variance \(s^2\) of the univariate normal random variables \(s^2=\frac{1}{n-1}\sum_{j=1}^{n}(X_j-\overline X)^2=\frac{\sigma^2}{n-1}\sum_{j=1}^{n}(\frac{X_j-\overline X}{\sigma})^2=\frac{\sigma^2}{n-1}\chi_{n-1}^2\). The sampling distribution of the sample covariance matrix of multivariate normal random variables \(N_p(\boldsymbol\mu,\mathbf\Sigma)\) is \(\mathbf S=\frac{1}{n-1}\sum_{j=1}^{n}\mathbf Z_j\mathbf Z_j^T\) and \(\sum_{j=1}^{n}\mathbf Z_j\mathbf Z_j^T\) is called the Wishart distribution with \(n\) df.

  • \(\overline {\mathbf X}\) is distributed as \(N_p(\boldsymbol\mu,\frac{1}{n}\mathbf\Sigma)\) then \(\sqrt{n}(\overline {\mathbf X}-\boldsymbol\mu)\) is distributed as \(N_p(0,\mathbf\Sigma)\) and \(n(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf\Sigma^{-1}(\overline {\mathbf X}-\boldsymbol\mu)\) is distributed as \(\chi_p^2\)

  • Wishart distribution
    If \(\mathbf X_j\) is drawn from a p-variate normal distribution with zero mean \(N_p(\boldsymbol0, \boldsymbol\Sigma)\), let \(\mathbf X_j\) as the columns of \((p\times n)\) matrix \[\underset{(p\times n)}{\mathbf Z}=\begin{bmatrix} \mathbf X_1&\mathbf X_2&\cdots&\mathbf X_n \end{bmatrix}\], Then the Wishart distribution is the probability distribution of the \(p\times p\) random matrix \[\underset{(p\times p)}{\mathbf W}=\mathbf Z\mathbf Z^T=\sum_{j=1}^{n}\mathbf X_j\mathbf X_j^T\], which is known as the scatter matrix. The probability distribution of the scatter matrix can be denoted as \(\mathbf W_p(\boldsymbol\Sigma, n)\) or \(\mathbf W_p(\boldsymbol\Sigma,p, n)\) with \(n\) as the degrees of freedom, and the symmetric matrix \(\mathbf W\) is invertible when \(n \ge p\). If \(p = \boldsymbol\Sigma = 1\) then this distribution is a \(\chi^2\)-distribution with \(n\) degrees of freedom.
    Let \(\mathbf X\) be a \(p\times p\) symmetric matrix of random variables that is positive definite. The probability density function of Wishart distribution\(\mathbf W\) with \(n\) degrees of freedom is: \(f_{\mathbf X}(\mathbf X)=\frac{1}{2^{\frac{np}{2}}|\mathbf W|^{n/2}\Gamma_p(\frac{n}{2})}|\mathbf X|^{(n-p-1)/2}e^{-\frac{1}{2}tr(\mathbf W^{-1}\mathbf X)}\), where \(|\mathbf X|\) is the determinant of \(\mathbf X\) and \(\Gamma_p(\frac{n}{2})\) is the multivariate gamma function defined as: \[\Gamma_p(\frac{n}{2})=\pi^{p(p-1)/4}\prod_{j=1}^{p}\Gamma(\frac{n}{2}-\frac{j-1}{2})\]. The distribution is the joint density of \(p(p+1)/2\) elements \(X_{ij}\) for \(i\le j\)