Linear Regression - A Hugo website

If there are \(n\) points \((x_1,y_1),(x_2,y_3),...,(x_n,y_n)\), the straight line \(y=a+bx\) minimizing the sum of the squares of the vertical distances from the data points to the line \(L=\sum_{i=1}^{n}(y_i-a-bx_i)^2\), then we take partial derivatives of L with respect to \(a\) and \(b\) and let them equal to \(0\) to get least squares coefficients \(a\) and \(b\): \[\frac{\partial L}{\partial b}=-2\sum_{i=1}^{n}(y_i-a-bx_i)x_i=0\], then \[\sum_{i=1}^{n}x_iy_i=a\sum_{i=1}^{n}x_i+b\sum_{i=1}^{n}x_i^2\]
And, \[\frac{\partial L}{\partial a}=-2\sum_{i=1}^{n}(y_i-a-bx_i)=0\], then \[\sum_{i=1}^{n}y_i=na+b\sum_{i=1}^{n}x_i\] these 2 equations are: \[ \begin{bmatrix} \displaystyle\sum_{i=1}^{n}x_i & \displaystyle\sum_{i=1}^{n}x_i^2\\ n & \displaystyle\sum_{i=1}^{n}x_i\\ \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} = \begin{bmatrix} \displaystyle\sum_{i=1}^{n}x_iy_i\\ \displaystyle\sum_{i=1}^{n}y_i \end{bmatrix} \] then, using Cramer’s rule \[\begin{align} b&=\frac{\begin{bmatrix} \displaystyle\sum_{i=1}^{n}x_i & \displaystyle\sum_{i=1}^{n}x_iy_i\\ n & \displaystyle\sum_{i=1}^{n}y_i\\ \end{bmatrix}}{\begin{bmatrix} \displaystyle\sum_{i=1}^{n}x_i & \displaystyle\sum_{i=1}^{n}x_i^2\\ n & \displaystyle\sum_{i=1}^{n}x_i\\ \end{bmatrix}}\\ &=\frac{(\displaystyle\sum_{i=1}^{n}x_i)(\displaystyle\sum_{i=1}^{n}y_i)-n(\displaystyle\sum_{i=1}^{n}x_iy_i)}{(\displaystyle\sum_{i=1}^{n}x_i)^2-n\displaystyle\sum_{i=1}^{n}x_i^2}\\ &=\frac{n(\displaystyle\sum_{i=1}^{n}x_iy_i)-(\displaystyle\sum_{i=1}^{n}x_i)(\displaystyle\sum_{i=1}^{n}y_i)}{n\displaystyle\sum_{i=1}^{n}x_i^2-(\displaystyle\sum_{i=1}^{n}x_i)^2}\\ &=\frac{(\displaystyle\sum_{i=1}^{n}x_iy_i)-\frac{1}{n}(\displaystyle\sum_{i=1}^{n}x_i)(\displaystyle\sum_{i=1}^{n}y_i)}{\displaystyle\sum_{i=1}^{n}x_i^2-\frac{1}{n}(\displaystyle\sum_{i=1}^{n}x_i)^2} \end{align}\], and, \(a=\frac{\displaystyle\sum_{i=1}^{n}y_i-b\sum_{i=1}^{n}x_i}{n}=\bar y-b\bar x\), which shows point \((\bar x, \bar y)\) is in the line.

We call \(\hat y_i=a+bx_i\) the predicted value of \(y_i\), and \(y_i-\hat y_i\) the \(i^{th}\) residual.

univariate linear regression models:
When there are \(n\) response variables \(Y_i, i=1,2,\cdots,n\) and \(r\) predictor variables \(Z_{ij},j=1,2,\cdots,r\) for each response variable: \[\begin{bmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n\\ \end{bmatrix}=\begin{bmatrix} 1&Z_{11}&Z_{12}&\cdots&Z_{1r}\\ 1&Z_{21}&Z_{22}&\cdots&Z_{2r}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 1&Z_{n1}&Z_{n2}&\cdots&Z_{nr}\\ \end{bmatrix}\begin{bmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_r\\ \end{bmatrix}+\begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix}\] or \(\mathbf Y=\mathbf Z\boldsymbol\beta+\boldsymbol\epsilon\), where \(E(\boldsymbol\epsilon)=\boldsymbol0\), \(Cov(\boldsymbol\epsilon)=E(\boldsymbol\epsilon\boldsymbol\epsilon^T)=\sigma^2\mathbf I\). We have to find the regression coefficients \(\boldsymbol\beta\) and the error variance \(\sigma^2\) that are consistent with the available data. Denote \(\hat{\boldsymbol\beta}\) as least squares estimate of \(\boldsymbol\beta\), then \(\hat{\boldsymbol y}=\mathbf Z\hat{\boldsymbol\beta}\) and \(\hat{\boldsymbol y}\) is the projection of vector \(\boldsymbol y\) on the column space of \(\mathbf Z\), so \(\boldsymbol y-\hat{\boldsymbol y}=\boldsymbol y-\mathbf Z\hat{\boldsymbol\beta}\) is perpendicular to column space of \(\mathbf Z\), so \(\mathbf Z^T(\boldsymbol y-\mathbf Z\hat{\boldsymbol\beta})=\mathbf Z^T\boldsymbol y-\mathbf Z^T\mathbf Z\hat{\boldsymbol\beta}=\boldsymbol0\Rightarrow\mathbf Z^T\boldsymbol y=\mathbf Z^T\mathbf Z\hat{\boldsymbol\beta}\). \(\mathbf Z^T\mathbf Z\) is a \((r+1)(r+1)\) asymmetric matrix. Because the column space of \(\mathbf Z^T\mathbf Z\) must in the column space of \(\mathbf Z\) and row space of \(\mathbf Z^T\mathbf Z\) must in the row space of \(\mathbf Z\) so the rank of \(rank(\mathbf Z^T\mathbf Z)\le r+1\) and \(rank(\mathbf Z^T\mathbf Z)\le n\), because \(\mathbf Z^T\mathbf Z\) is a \((r+1)(r+1)\) matrix, so only when \(r+1\le n\), it will be possible that \(rank(\mathbf Z^T\mathbf Z)=r+1\) and \(\mathbf Z^T\mathbf Z\) is invertible.
Then \(\hat{\boldsymbol\beta}=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol y\) and \(\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol y=\hat{\boldsymbol y}\) , so \(\boldsymbol y-\hat{\boldsymbol y}=(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y\) It is clear that every column vector of \((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\) is perpendicular to column space of \(\mathbf Z\) caused by \(\mathbf Z^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)=\mathbf0\). And \((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)=\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\), so \(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\) is an Idempotent matrix. So \[\begin{align} tr[\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T]&=tr[\mathbf 1]-tr[\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T]\\ &=tr[\mathbf 1]-tr[\mathbf Z^T\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}]\\ &=tr[\underset{n\times n}{\mathbf 1}]-tr[\underset{(r+1)\times (r+1)}{\mathbf 1}]\\ &=n-r-1 \end{align}\]
Because \(\hat{\boldsymbol\epsilon}=\boldsymbol y-\hat{\boldsymbol y}\) is perpendicular to column space of \(\mathbf Z\), then \(\mathbf Z^T\hat{\boldsymbol\epsilon}=\boldsymbol0\) and \(\boldsymbol1^T\hat{\boldsymbol\epsilon}=\displaystyle\sum_{i=1}^{n}\hat{\boldsymbol\epsilon}_i=\displaystyle\sum_{i=1}^{n}y_i-\displaystyle\sum_{i=1}^{n}\hat y_i=n\bar y-n\bar{\hat y}=0\). Then \(\bar y=\bar{\hat y}\) And because \(\mathbf y^T\mathbf y=(\hat{\mathbf y}+\mathbf y-\hat{\mathbf y})^T(\hat{\mathbf y}+\mathbf y-\hat{\mathbf y})=(\hat{\mathbf y}+\hat{\boldsymbol\epsilon})^T(\hat{\mathbf y}+\hat{\boldsymbol\epsilon})=\hat{\mathbf y}^T\hat{\mathbf y}+\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}\), this also can be get using Pythagorean theorem. Then sum of squares decomposition can be: \(\mathbf y^T\mathbf y-n(\bar y)^2=\hat{\mathbf y}^T\hat{\mathbf y}-n(\bar{\hat y})^2+\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}\) or \(\underset{SS_{total}}{\displaystyle\sum_{i=1}^{n}(y_i-\bar y)^2}=\underset{SS_{regression}}{\displaystyle\sum_{i=1}^{n}(\hat y_i-\bar{\hat y})^2}+\underset{SS_{error}}{\displaystyle\sum_{i=1}^{n}\hat\epsilon_i^2}\) The quality of the models fit can be measured by the coefficient of determination \(R^2=\frac{\displaystyle\sum_{i=1}^{n}(\hat y_i-\bar{\hat y})^2}{\displaystyle\sum_{i=1}^{n}(y_i-\bar y)^2}\)
Inferences About the Regression Model:

Because \(E(\hat{\boldsymbol\beta})=E((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y)\), so \(E(\mathbf Z\hat{\boldsymbol\beta})=E(\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y)=E(\hat{\mathbf Y})\) and because
\(E(\mathbf Z\boldsymbol\beta)=E(\mathbf Y-\boldsymbol\epsilon)=E(\mathbf Y)=E(\hat{\mathbf Y})\), so \(E(\mathbf Z\hat{\boldsymbol\beta})=E(\mathbf Z\boldsymbol\beta)\) and \(E(\mathbf Z(\hat{\boldsymbol\beta}-\boldsymbol\beta))=\boldsymbol0\) so \(E(\hat{\boldsymbol\beta})=E(\boldsymbol\beta)\)
If \(\boldsymbol\epsilon\) is distributed as \(N_n(\boldsymbol0,\sigma^2\mathbf I)\), then \(Cov(\mathbf Z\hat{\boldsymbol\beta})=\mathbf Z^T\mathbf ZCov(\hat{\boldsymbol\beta})=Cov(\hat{\mathbf Y})=Cov(\mathbf Y+\boldsymbol\epsilon)=Cov(\boldsymbol\epsilon)=\sigma^2\mathbf I\), then \(Cov(\hat{\boldsymbol\beta})=\sigma^2(\mathbf Z^T\mathbf Z)^{-1}\), then \(\hat{\boldsymbol\beta}\) is distributed as \(N_{(r+1)}(\boldsymbol\beta,\sigma^2(\mathbf Z^T\mathbf Z)^{-1})\)
Because \(E(\hat{\boldsymbol\epsilon})=\boldsymbol 0\) and \(Cov(\hat{\boldsymbol\epsilon})=Cov(\boldsymbol y-\hat{\boldsymbol y})=Cov((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y)=(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\sigma^2\) \[\begin{align} E(\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon})&=E(\boldsymbol y-\hat{\boldsymbol y})^T(\boldsymbol y-\hat{\boldsymbol y})\\ &=E((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y)^T((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y)\\ &=E(\boldsymbol y^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T))((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y)\\ &=E(\boldsymbol y^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y)\\ &=E(\boldsymbol y^T\boldsymbol y)-E(\boldsymbol y^T\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol y)\\ &=n\sigma^2-(r+1)\sigma^2\\ &=(n-r-1)\sigma^2\\ \end{align}\] so defining \[s^2=\frac{\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}}{n-r-1}=\frac{\boldsymbol y^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol y}{n-r-1}\] Then \(E(s^2)=\sigma^2\)
If \(\hat\sigma^2\) is the maximum likelihood estimator of \(\sigma^2\), then \(n\hat\sigma^2=(n-r-1)s^2=\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}\) is distributed as \(\sigma^2\chi_{n-r-1}^2\). The likelihood associated with the parameters \(\boldsymbol\beta\) and \(\sigma^2\) is \(L(\boldsymbol\beta,\sigma^2)=\frac{1}{(2\pi)^{n/2}\sigma^n}exp\Bigl[-\frac{(\mathbf y-\mathbf Z\boldsymbol\beta)^T(\mathbf y-\mathbf Z\boldsymbol\beta)}{2\sigma^2}\Bigr]\) with the maximum occurring at \(\hat{\boldsymbol\beta}=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol y\). For the maximum of \(\hat\sigma^2=(\mathbf y-\mathbf Z\hat{\boldsymbol\beta})^T(\mathbf y-\mathbf Z\hat{\boldsymbol\beta})/n\), then the maximum likelihood is \(L(\hat{\boldsymbol\beta},\hat\sigma^2)=\displaystyle\frac{1}{(2\pi)^{n/2}\hat\sigma^n}e^{-n/2}\)
Let \(\mathbf V=(\mathbf Z^T\mathbf Z)^{1/2}(\hat{\boldsymbol\beta}-\boldsymbol\beta)\), then \(E(\mathbf V)=\mathbf0\) and \(Cov(\mathbf V)=(\mathbf Z^T\mathbf Z)^{1/2}Cov(\hat{\boldsymbol\beta}-\boldsymbol\beta)(\mathbf Z^T\mathbf Z)^{1/2}=(\mathbf Z^T\mathbf Z)^{1/2}Cov(\hat{\boldsymbol\beta})(\mathbf Z^T\mathbf Z)^{1/2}=(\mathbf Z^T\mathbf Z)^{1/2}\sigma^2(\mathbf Z^T\mathbf Z)^{-1}(\mathbf Z^T\mathbf Z)^{1/2}=\sigma^2\mathbf I\), so \(\mathbf V\) is normally distributed and \(\mathbf V^T\mathbf V=(\hat{\boldsymbol\beta}-\boldsymbol\beta)^T(\mathbf Z^T\mathbf Z)^{1/2}(\mathbf Z^T\mathbf Z)^{1/2}(\hat{\boldsymbol\beta}-\boldsymbol\beta)=(\hat{\boldsymbol\beta}-\boldsymbol\beta)^T(\mathbf Z^T\mathbf Z)(\hat{\boldsymbol\beta}-\boldsymbol\beta)\) is distributed as \(\sigma^2\chi_{r+1}^2\), then \(\frac{\chi_{r+1}^2/(r+1)}{\chi_{n-r-1}^2/(n-r-1)}=\frac{\mathbf V^T\mathbf V}{(r+1)s^2}\) has an \(F_{(r+1),(n-r-1)}\) distribution. Then a \(100(1-\alpha)\%\) confidence region for \(\boldsymbol\beta\) is given by: \((\boldsymbol\beta-\hat{\boldsymbol\beta})^T(\mathbf Z^T\mathbf Z)(\boldsymbol\beta-\hat{\boldsymbol\beta})\le(r+1)s^2F_{(r+1),(n-r-1)}(\alpha)\). Also for each component of \(\boldsymbol\beta\), \(|\beta_i-\hat{\beta}_i|\le\sqrt{(r+1)F_{(r+1),(n-r-1)}(\alpha)}\sqrt{s^2(\mathbf Z^T\mathbf Z)_{ii}^{-1}}\), where \((\mathbf Z^T\mathbf Z)_{ii}^{-1}\) is the \(i^{th}\) diagonal element of \((\mathbf Z^T\mathbf Z)^{-1}\). Then, the simultaneous \(100(1-\alpha)\%\) confidence intervals for the \(\beta_i\) are given by: \(\hat{\beta}_i\pm\sqrt{(r+1)F_{(r+1),(n-r-1)}(\alpha)}\sqrt{s^2(\mathbf Z^T\mathbf Z)_{ii}^{-1}}\)
If some of \(\mathbf Z=[z_0,z_1,z_2,\cdots,z_q,z_{q+1},z_{q+2},\cdots,z_{r}]\) parameters \([z_{q+1},z_{q+2},\cdots,z_{r}]\) do not influence \(\mathbf y\), which means the hypothesis \(H_0:\beta_{q+1}=\beta_{q+2}=\cdots=\beta_{r}=0\). We can express the general linear model as \[\mathbf y=\mathbf Z\boldsymbol\beta+\boldsymbol\epsilon=\left[\begin{array}{c:c} \underset{n\times(q+1)}{\mathbf Z_1}&\underset{n\times(r-q)}{\mathbf Z_2} \end{array} \right]\begin{bmatrix} \underset{(q+1)\times1}{\boldsymbol \beta_{(1)}}\\ \hdashline \underset{(r-q)\times1}{\boldsymbol \beta_{(2)}}\\ \end{bmatrix}+\boldsymbol\epsilon=\mathbf Z_1\boldsymbol\beta_{(1)}+\mathbf Z_2\boldsymbol\beta_{(2)}+\boldsymbol\epsilon\] Now the hypothesis is \(H_0:\boldsymbol\beta_{(2)}=\boldsymbol0,\mathbf y=\mathbf Z_1\boldsymbol\beta_{(1)}+\boldsymbol\epsilon\) The \[\begin{align} \text{Extra sum of squares}&=SS_{\text{error}}(\mathbf Z_1)-SS_{\text{error}}(\mathbf Z)\\ &=(\mathbf y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})^T(\mathbf y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})-(\mathbf y-\mathbf Z\hat{\boldsymbol\beta})^T(\mathbf y-\mathbf Z\hat{\boldsymbol\beta}) \end{align}\] where \(\hat{\boldsymbol\beta}_{(1)}=(\mathbf Z_1^T\mathbf Z_1)^{-1}\mathbf Z_1^T\mathbf y\). Under the restriction of the null hypothesis, the maximum likelihood is \(\underset{\boldsymbol\beta_{(1)},\sigma^2}{\text{max }}L(\boldsymbol\beta_{(1)},\sigma^2)=\displaystyle\frac{1}{(2\pi)^{n/2}\hat\sigma_1^n}e^{-n/2}\) where the maximum occurs at \(\hat{\boldsymbol\beta}_{(1)}=(\mathbf Z_1^T\mathbf Z_1)^{-1}\mathbf Z_1^T\boldsymbol y\) and \(\hat\sigma_1^2=(\mathbf y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})^T(\mathbf y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})/n\) Reject \(H_0:\boldsymbol\beta_{(2)}=\boldsymbol0\) for small values of the likelihood ratio test : \[\frac{\underset{\boldsymbol\beta_{(1)},\sigma^2}{\text{max }}L(\boldsymbol\beta_{(1)},\sigma^2)}{\underset{\boldsymbol\beta,\sigma^2}{\text{max }}L(\boldsymbol\beta,\sigma^2)}=\Bigl(\frac{\hat\sigma_1}{\hat\sigma}\Bigr)^{-n}=\Bigl(\frac{\hat\sigma^2_1}{\hat\sigma^2}\Bigr)^{-n/2}=\Bigl(1+\frac{\hat\sigma^2_1-\hat\sigma^2}{\hat\sigma^2}\Bigr)^{-n/2}\] Which is equivalent to rejecting \(H_0\) when \(\frac{\hat\sigma^2_1-\hat\sigma^2}{\hat\sigma^2}\) is large or when \[\frac{n(\hat\sigma^2_1-\hat\sigma^2)/(r-q)}{n\hat\sigma^2/(n-r-1)}=\frac{(SS_{error}(\mathbf Z_1)-SS_{error}(\mathbf Z))/(r-q)}{s^2}=F_{(r-q),(n-r-1)}\] is large.
Let \(y_0\) denote the response value when the predictor variables have values \(\mathbf z^T_0=[1,z_{01},z_{02},\cdots,z_{0r}]\), then \(E(y_0|\mathbf z_0)=\beta_0+z_{01}\beta_1+\cdots+z_{0r}\beta_r=\mathbf z^T_0\boldsymbol\beta\), its least squares estimate is \(\mathbf z^T_0\hat{\boldsymbol\beta}\). Because \(Cov(\hat{\boldsymbol\beta})=\sigma^2(\mathbf Z^T\mathbf Z)^{-1}\) and \(\hat{\boldsymbol\beta}\) is distributed as \(N_{(r+1)}(\boldsymbol\beta,\sigma^2(\mathbf Z^T\mathbf Z)^{-1})\), so \(Var(\mathbf z^T_0\hat{\boldsymbol\beta})=\mathbf z^T_0Cov(\hat{\boldsymbol\beta})\mathbf z_0=\sigma^2\mathbf z^T_0(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0\). Because \((n-r-1)s^2=\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}\) is distributed as \(\sigma^2\chi_{n-r-1}^2\), so \(\frac{s^2}{\sigma^2}=\chi_{n-r-1}^2/(n-r-1)\). Consequently, the linear combination \(\mathbf z^T_0\hat{\boldsymbol\beta}\) is distributed as \(N(\mathbf z^T_0\boldsymbol\beta,\sigma^2\mathbf z^T_0(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0)\) and \(\displaystyle\frac{(\mathbf z^T_0\hat{\boldsymbol\beta}-\mathbf z^T_0\boldsymbol\beta)/\sqrt{\sigma^2\mathbf z^T_0(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0}}{\sqrt{\frac{s^2}{\sigma^2}}}\) is distributed as \(t_{n-r-1}\). Then a \(100(1-\alpha)\%\) confidence interval for \(E(y_0|\mathbf z_0)=\mathbf z^T_0\boldsymbol\beta\) is given by: \(\mathbf z^T_0\hat{\boldsymbol\beta}\pm t_{n-r-1}(\frac{\alpha}{2})\sqrt{\frac{s^2}{\sigma^2}}\sqrt{\sigma^2\mathbf z^T_0(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0}=\mathbf z^T_0\hat{\boldsymbol\beta}\pm t_{n-r-1}(\frac{\alpha}{2})\sqrt{s^2\mathbf z^T_0(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0}\)

Multivariate linear regression:
When there are \(m\) regression coefficients vectors \(\boldsymbol\beta_1,\cdots,\boldsymbol\beta_m\), each regression coefficients vector \(\boldsymbol\beta\) has \(r+1\) components \([\beta_0,\beta_1,\cdots,\beta_r]\), then the response variables are: \[\underset{(n\times m)}{\underbrace{\begin{bmatrix} Y_{11}&Y_{12}&\cdots&Y_{1m}\\ Y_{21}&Y_{22}&\cdots&Y_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ Y_{n1}&Y_{n2}&\cdots&Y_{nm}\\ \end{bmatrix}}}=\underset{(n\times (r+1))}{\underbrace{\begin{bmatrix} 1&Z_{11}&Z_{12}&\cdots&Z_{1r}\\ 1&Z_{21}&Z_{22}&\cdots&Z_{2r}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 1&Z_{n1}&Z_{n2}&\cdots&Z_{nr}\\ \end{bmatrix}}}\underset{((r+1)\times m)}{\underbrace{\begin{bmatrix} \beta_{01}&\beta_{02}&\cdots&\beta_{0m}\\ \beta_{11}&\beta_{12}&\cdots&\beta_{1m}\\ \vdots&\vdots&\ddots&\vdots\\ \beta_{r1}&\beta_{r2}&\cdots&\beta_{rm}\\ \end{bmatrix}}}+\underset{(n\times m)}{\underbrace{\begin{bmatrix} \epsilon_{11}&\epsilon_{12}&\cdots&\epsilon_{1m}\\ \epsilon_{21}&\epsilon_{22}&\cdots&\epsilon_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ \epsilon_{n1}&\epsilon_{n2}&\cdots&\epsilon_{nm}\\ \end{bmatrix}}}\] or \(\mathbf Y=\mathbf Z\boldsymbol\beta+\boldsymbol\epsilon\), the \(j^{th}\) row of \(\mathbf Y\) is the \(m\) observations on the \(j^{th}\) trial and regressed using \(m\) regression coefficients vectors. The \(i^{th}\) column of \(\mathbf Y\) is the \(n\) observations on all of the \(n\) trials and regressed using the \(i^{th}\) regression coefficients vector. The \(i^{th}\) column of \(\boldsymbol\epsilon\) is the \(n\) errors of the \(n\) observations regressed using the \(i^{th}\) regression coefficients vector \(\boldsymbol\beta_i\), \(\mathbf Y_i=\mathbf Z\boldsymbol\beta_i+\boldsymbol\epsilon_i\)
The \(2\) columns of \(\boldsymbol\epsilon\), \(\boldsymbol\epsilon_i\) and \(\boldsymbol\epsilon_k\) are independent, with \(E(\boldsymbol\epsilon_i)=\boldsymbol0\) and \(Cov(\boldsymbol\epsilon_i,\boldsymbol\epsilon_k)=\sigma_{ik}\mathbf I, \quad i,(k=1,2,\cdots,m)\). The \(m\) observations on the \(j^{th}\) trial with \(2\) regression coefficients vectors \(\boldsymbol\beta_i,\boldsymbol\beta_k\) have covariance matrix \(\underset{m\times m}{\boldsymbol\Sigma}=\{\sigma_{ik}\}\)
Like in the univariate linear regression models, \(\hat{\boldsymbol\beta}_i=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y_i\) and \(\hat{\boldsymbol\beta}=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y\)
The Predicted values:\(\hat{\mathbf Y}=\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y\),
Residuals: \(\hat{\boldsymbol\epsilon}=\mathbf Y-\hat{\mathbf Y}=(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\mathbf Y\) also because \(\mathbf Z^T\hat{\boldsymbol\epsilon}=\mathbf Z^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\mathbf Y=\mathbf0\), so the residuals \(\hat{\boldsymbol\epsilon}_i\) are perpendicular to the columns of \(\mathbf Z\), also because \(\hat{\mathbf Y}^T\hat{\boldsymbol\epsilon}=(\mathbf Z\hat{\boldsymbol\beta})^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\mathbf Y=\hat{\boldsymbol\beta}^T\mathbf Z^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\mathbf Y=\mathbf0\), so \(\hat{\mathbf Y}_i\) are perpendicular to all residual vectors \(\hat{\boldsymbol\epsilon}_i\) So \(\mathbf Y^T\mathbf Y=(\hat{\mathbf Y}+\hat{\boldsymbol\epsilon})^T(\hat{\mathbf Y}+\hat{\boldsymbol\epsilon})=\hat{\mathbf Y}^T\hat{\mathbf Y}+\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}+\mathbf0+\mathbf0\) or \(\mathbf Y^T\mathbf Y=\hat{\mathbf Y}^T\hat{\mathbf Y}+\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}\)

Because \(\mathbf Y_i=\mathbf Z\boldsymbol\beta_i+\boldsymbol\epsilon_i\), then \(\hat{\boldsymbol\beta}_i-\boldsymbol\beta_i=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y_i-\boldsymbol\beta_i=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T(\mathbf Z\boldsymbol\beta_i+\boldsymbol\epsilon_i)-\boldsymbol\beta_i=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol\epsilon_i\) so \(E(\hat{\boldsymbol\beta}_i-\boldsymbol\beta_i)=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^TE(\boldsymbol\epsilon_i)=\mathbf0\), or \(E(\hat{\boldsymbol\beta}_i)=\boldsymbol\beta_i\) and because \(\hat{\boldsymbol\epsilon}_i=\mathbf Y_i-\hat{\mathbf Y}_i=(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\mathbf Y_i=(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)(\mathbf Z\boldsymbol\beta_i+\boldsymbol\epsilon_i)=(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_i\), so \(E(\hat{\boldsymbol\epsilon}_i)=\mathbf0\) And \[\begin{align} Cov(\hat{\boldsymbol\beta}_i,\hat{\boldsymbol\beta}_k)&=E(\hat{\boldsymbol\beta}_i-\boldsymbol\beta_i)(\hat{\boldsymbol\beta}_k-\boldsymbol\beta_k)^T\\ &=E((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol\epsilon_i)((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol\epsilon_k)^T\\ &=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^TE(\boldsymbol\epsilon_i\boldsymbol\epsilon_k^T)\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\\ &=\sigma_{ik}(\mathbf Z^T\mathbf Z)^{-1} \end{align}\] Then \[\begin{align} E(\hat{\boldsymbol\epsilon}_i^T\hat{\boldsymbol\epsilon}_k)&=E\Biggl(((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_i)^T((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_k)\Biggr)\\ &=E(\boldsymbol\epsilon_i^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_k)\\ &=E(tr(\boldsymbol\epsilon_i^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_k))\\ &=E(tr((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_k\boldsymbol\epsilon_i^T))\\ &=tr[(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)E(\boldsymbol\epsilon_k\boldsymbol\epsilon_i^T)]\\ &=tr[(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\sigma_{ik}\mathbf I]\\ &=\sigma_{ik}tr[\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T]\\ &=\sigma_{ik}(n-r-1) \end{align}\] \[\begin{align} Cov(\hat{\boldsymbol\beta}_i,\hat{\boldsymbol\epsilon}_k)&=E(\hat{\boldsymbol\beta}_i-E(\hat{\boldsymbol\beta}_i))(\hat{\boldsymbol\epsilon}_k-E(\hat{\boldsymbol\epsilon}_k))^T\\ &=E(\hat{\boldsymbol\beta}_i-\boldsymbol\beta_i)(\hat{\boldsymbol\epsilon}_k-\boldsymbol\epsilon_k)^T\\ &=E((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol\epsilon_i)(\hat{\boldsymbol\epsilon}_k)^T\\ &=E((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol\epsilon_i)((\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\boldsymbol\epsilon_k)^T\\ &=E((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\boldsymbol\epsilon_i)(\boldsymbol\epsilon_k^T(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T))\\ &=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^TE(\boldsymbol\epsilon_i\boldsymbol\epsilon_k^T)(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T))\\ &=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\sigma_{ik}\mathbf I(\mathbf 1-\mathbf Z(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T))\\ &=\sigma_{ik}((\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T-(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T)\\ &=\mathbf0 \end{align}\]so each element of \(\hat{\boldsymbol\beta}\) is uncorrelated with each element of \(\hat{\boldsymbol\epsilon}\).
\(\hat{\boldsymbol\beta}=(\mathbf Z^T\mathbf Z)^{-1}\mathbf Z^T\mathbf Y\) is the maximum likelihood estimator of \(\boldsymbol\beta\) and \(\hat{\boldsymbol\beta}\) is a normal distribution with \(E(\hat{\boldsymbol\beta})=\boldsymbol\beta\) and \(Cov(\hat{\boldsymbol\beta}_i,\hat{\boldsymbol\beta}_k)=\sigma_{ik}(\mathbf Z^T\mathbf Z)^{-1}\). And \(\hat{\boldsymbol\Sigma}=\frac{1}{n}\hat{\boldsymbol\epsilon}^T\hat{\boldsymbol\epsilon}=\frac{1}{n}(\mathbf Y-\mathbf Z\hat{\boldsymbol\beta})^T(\mathbf Y-\mathbf Z\hat{\boldsymbol\beta})\) is the maximum likelihood estimator of \(\boldsymbol\Sigma\) and \(n\hat{\boldsymbol\Sigma}\) is distribution with \(W_{p,n-r-1}(\boldsymbol\Sigma)\). The likelihood associated with the parameters \(\boldsymbol\beta\) and \(\boldsymbol\Sigma\) is \[\begin{align} L(\hat{\boldsymbol\mu},\hat{\boldsymbol\Sigma})&=\prod_{j=1}^{n}\Biggl[\frac{1}{(2\pi)^{m/2}|\hat{\boldsymbol\Sigma}|^{1/2}}e^{-\frac{1}{2}(\mathbf y_j-\boldsymbol\mu)^T\hat{\boldsymbol\Sigma}^{-1}(\mathbf y_j-\boldsymbol\mu)}\Biggr]\\ &=\frac{1}{(2\pi)^{\frac{mn}{2}}|\hat{\mathbf\Sigma}|^{n/2}}e^{-\frac{1}{2}\sum_{j=1}^{n}(\mathbf y_j-\boldsymbol\mu)^T\hat{\boldsymbol\Sigma}^{-1}(\mathbf y_j-\boldsymbol\mu)}\\ &=\frac{1}{(2\pi)^{\frac{mn}{2}}|\hat{\boldsymbol\Sigma}|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\hat{\boldsymbol\Sigma}^{-1}\sum_{j=1}^{n}(\mathbf y_j-\boldsymbol\mu)^T(\mathbf y_j-\boldsymbol\mu)\Biggr]}\\ &=\frac{1}{(2\pi)^{(mn/2)}|\hat{\boldsymbol\Sigma}|^{n/2}}e^{-\frac{1}{2}mn} \end{align}\]
If some of \(\mathbf Z=[z_0,z_1,z_2,\cdots,z_q,z_{q+1},z_{q+2},\cdots,z_{r}]\) parameters \([z_{q+1},z_{q+2},\cdots,z_{r}]\) do not influence \(\mathbf Y\), which means the hypothesis \(H_0:\boldsymbol\beta_{q+1}=\boldsymbol\beta_{q+2}=\cdots=\boldsymbol\beta_{r}=\boldsymbol0\). We can express the general linear model as \[\mathbf Y=\mathbf Z\boldsymbol\beta+\boldsymbol\epsilon=\left[\begin{array}{c:c} \underset{n\times(q+1)}{\mathbf Z_1}&\underset{n\times(r-q)}{\mathbf Z_2} \end{array} \right]\begin{bmatrix} \underset{(q+1)\times m}{\boldsymbol\beta_{(1)}}\\ \hdashline \underset{(r-q)\times m}{\boldsymbol\beta_{(2)}}\\ \end{bmatrix}+\boldsymbol\epsilon=\mathbf Z_1\boldsymbol\beta_{(1)}+\mathbf Z_2\boldsymbol\beta_{(2)}+\boldsymbol\epsilon\] Now the hypothesis is \(H_0:\boldsymbol\beta_{(2)}=\boldsymbol0,\mathbf Y=\mathbf Z_1\boldsymbol\beta_{(1)}+\boldsymbol\epsilon\) The \[\begin{align} \text{Extra sum of squares}&=SS_{\text{error}}(\mathbf Z_1)-SS_{\text{error}}(\mathbf Z)\\ &=(\mathbf Y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})^T(\mathbf Y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})-(\mathbf Y-\mathbf Z\hat{\boldsymbol\beta})^T(\mathbf Y-\mathbf Z\hat{\boldsymbol\beta}) \end{align}\] where \(\hat{\boldsymbol\beta}_{(1)}=(\mathbf Z_1^T\mathbf Z_1)^{-1}\mathbf Z_1^T\mathbf Y\). And \[\hat{\boldsymbol\Sigma}_1=\frac{1}{n}SS_{\text{error}}(\mathbf Z_1)=\frac{1}{n}(\mathbf Y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})^T(\mathbf Y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})\] Under the restriction of the null hypothesis, the maximum likelihood is \(\underset{\boldsymbol\beta_{(1)},\boldsymbol\Sigma_1}{\text{max }}L(\boldsymbol\beta_{(1)},\boldsymbol\Sigma_1)=\displaystyle\frac{1}{(2\pi)^{mn/2}\hat{\boldsymbol\Sigma}_1^{n/2}}e^{-mn/2}\) where the maximum occurs at \(\hat{\boldsymbol\beta}_{(1)}=(\mathbf Z_1^T\mathbf Z_1)^{-1}\mathbf Z_1^T\boldsymbol Y\) and \(\hat{\boldsymbol\Sigma}_1=\frac{1}{n}(\mathbf Y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})^T(\mathbf Y-\mathbf Z_1\hat{\boldsymbol\beta}_{(1)})\) Reject \(H_0:\boldsymbol\beta_{(2)}=\boldsymbol0\) for small values of the likelihood ratio test : \[\frac{\underset{\boldsymbol\beta_{(1)},\boldsymbol\Sigma_1}{\text{max }}L(\boldsymbol\beta_{(1)},\boldsymbol\Sigma_1)}{\underset{\boldsymbol\beta,\boldsymbol\Sigma}{\text{max }}L(\boldsymbol\beta,\boldsymbol\Sigma)}=\Bigl(\frac{|\hat{\boldsymbol\Sigma}_1|}{|\hat{\boldsymbol\Sigma}|}\Bigr)^{-n/2}=\Bigl(\frac{|\hat{\boldsymbol\Sigma}|}{|\hat{\boldsymbol\Sigma}_1|}\Bigr)^{n/2}=\Lambda\], where \(\Lambda^{2/n}=\frac{|\hat{\boldsymbol\Sigma}|}{|\hat{\boldsymbol\Sigma}_1|}\) is the Wilks’ lambda statistic.
The responses \(\mathbf y_0\) corresponding to fixed values \(\mathbf z_0\) of the predictor variables and regression coefficients metrix \(\underset{n\times m}{\boldsymbol\beta}\) is \(\mathbf y_0=\mathbf z_0^T\hat{\boldsymbol\beta}+\boldsymbol\epsilon_0\), \(E(\hat{\boldsymbol\beta}^T\mathbf z_0)=\boldsymbol\beta^T\mathbf z_0\) and \(Cov(\hat{\boldsymbol\beta}^T\mathbf z_0)=\mathbf z_0^TCov(\hat{\boldsymbol\beta}^T)\mathbf z_0=\mathbf z_0^T(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0\boldsymbol\Sigma\), so \(\underset{m\times 1}{(\hat{\boldsymbol\beta}^T\mathbf z_0)}\) is distributed as \(N_m(\boldsymbol\beta^T\mathbf z_0,\mathbf z_0^T(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0\boldsymbol\Sigma)\) and \(n\hat{\boldsymbol\Sigma}\) is independently distributed as \(W_{n-r-1}(\boldsymbol\Sigma)\). So the \(T^2\)-statistic is \[T^2=\Biggl(\frac{\hat{\boldsymbol\beta}^T\mathbf z_0-\boldsymbol\beta^T\mathbf z_0}{\sqrt{\mathbf z_0^T(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0}}\Biggr)^T\Biggl(\frac{n\hat{\boldsymbol\Sigma}}{n-r-1}\Biggr)^{-1}\Biggl(\frac{\hat{\boldsymbol\beta}^T\mathbf z_0-\boldsymbol\beta^T\mathbf z_0}{\sqrt{\mathbf z_0^T(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0}}\Biggr)\] and \(100(1-\alpha)\%\) confidence ellipsoid for \(\boldsymbol\beta^T\mathbf z_0\) is provided by the inequality \[T^2\le\frac{m(n-r-1)}{n-r-m}F_{m,n-r-m}(\alpha)\] The \(100(1-\alpha)\%\) simultaneous confidence intervals for the component of \(\mathbf z_0^T\boldsymbol\beta\), \(\mathbf z_0^T\boldsymbol\beta_i\) is given by: \[\mathbf z_0^T\hat{\boldsymbol\beta_i}\pm\sqrt{\frac{m(n-r-1)}{n-r-m}F_{m,n-r-m}(\alpha)}\sqrt{\mathbf z_0^T(\mathbf Z^T\mathbf Z)^{-1}\mathbf z_0}\sqrt{\frac{n\hat{\sigma_{ii}}}{n-r-1}} \] \(i=1,2,\cdots,m\)