4 min read

Inferences about the mean

The hypothesis testing about the mean is a test of the competing hypotheses: \(H_0:\mu=\mu_0\) and \(H_1:\mu\ne\mu_0\). If \(X_1,X_2,\cdots,X_n\) denote a random sample from a normal population, the appropriate test statistic is \(t=\frac{(\overline X-\mu_0)}{s/\sqrt{n}}\) with \(s^2=\frac{1}{(n-1)}\displaystyle\sum_{i=1}^{n}(X_i-\overline X)^2\). Rejecting \(H_0\) when \(|t|\) is large is equivalent to rejecting \(H_0\) when \(t^2=\frac{(\overline X-\mu_0)^2}{s^2/n}=n(\overline X-\mu_0)(s^2)^{-1}(\overline X-\mu_0)\) is large. Then the test becomes reject \(H_0\) in favor of \(H_1\) at significance level \(\alpha\) if \(n(\overline X-\mu_0)(s^2)^{-1}(\overline X-\mu_0)>t_{n-1}^2(\alpha/2)\), its multivariate analog is \(T^2=(\overline {\mathbf X}-\boldsymbol\mu_0)^T(\frac{1}{n}\mathbf S)^{-1}(\overline {\mathbf X}-\boldsymbol\mu_0)=n(\overline {\mathbf X}-\boldsymbol\mu_0)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu_0)\), where \(\overline {\mathbf X}=\frac{1}{n}\displaystyle\sum_{j=1}^{n}\mathbf X_j\), \(\underset{(p\times p)}{\mathbf S}=\frac{1}{n-1}\displaystyle\sum_{j=1}^{n}(\underset{(p\times 1)}{\mathbf X_j}-\underset{(p\times 1)}{\overline {\mathbf X}})(\underset{(p\times 1)}{\mathbf X_j}-\underset{(p\times 1)}{\overline {\mathbf X}})^T\)

Because \(n(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf\Sigma^{-1}(\overline {\mathbf X}-\boldsymbol\mu)\) is distributed as \(\chi_p^2\), \(T^2=n(\overline {\mathbf X}-\boldsymbol\mu_0)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu_0)\) is distributed as \(\frac{(n-1)p}{n-p}F_{p,n-p}\)

\[\begin{align} T^2&=\sqrt{n}(\overline {\mathbf X}-\boldsymbol\mu_0)^T\Bigl(\frac{1}{n-1}\displaystyle\sum_{j=1}^{n}(\mathbf X_j-\overline {\mathbf X})(\mathbf X_j-\overline {\mathbf X})^T\Bigr)^{-1}\sqrt{n}(\overline {\mathbf X}-\boldsymbol\mu_0)\\ &=N_p(\mathbf 0,\mathbf\Sigma)^T\Bigl(\frac{1}{n-1}\mathbf W_{p,n-1}(\mathbf\Sigma)\Bigr)^{-1}N_p(\mathbf 0,\mathbf\Sigma) \end{align}\] the \(\mathbf W_{p,n-1}(\mathbf\Sigma)\) is Wishart random matrix with \((n-1)\) d.f.

  • Under the hypothesis \(H_0:\boldsymbol\mu=\boldsymbol\mu_0\) the normal likelihood specializes to:\[L(\boldsymbol\mu_0,\mathbf\Sigma)=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol\mu_0)^T\mathbf\Sigma^{-1}(\mathbf x_j-\boldsymbol\mu_0)}\] The mean \(\boldsymbol\mu_0\) now is fixed but \(\mathbf\Sigma\) can be varied to find the value that is most likely to lead to the observed sample. This value is obtained by maximizing \(L(\boldsymbol\mu_0,\mathbf\Sigma)\) with respect to \(\mathbf\Sigma\): \(\underset{\mathbf\Sigma}{\text{max }}L(\boldsymbol\mu_0,\mathbf\Sigma)=\frac{1}{(2\pi)^{np/2}}\frac{1}{|\hat{\mathbf\Sigma_0}|^{n/2}}e^{-np/2}\) with \(\hat{\mathbf\Sigma_0}=\frac{1}{n}\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol\mu_0)(\mathbf x_j-\boldsymbol\mu_0)^T\). Then the maximum of \(L(\boldsymbol\mu_0,\mathbf\Sigma)\) is compared with the unrestricted maximum of \(L(\boldsymbol\mu,\mathbf\Sigma)\) to determine whether \(\boldsymbol\mu_0\) is a plausible value of \(\boldsymbol\mu\). The resulting ratio is called the likelihood ratio statistic and this method is called likelihood ratio test (LRT). The Likelihood ratio \(\Lambda\) is : \(\Lambda=\frac{\underset{\mathbf\Sigma}{\text{max }}L(\boldsymbol\mu_0,\mathbf\Sigma)}{\underset{\boldsymbol\mu,\mathbf\Sigma}{\text{max }}L(\boldsymbol\mu,\mathbf\Sigma)}=\frac{|\hat{\mathbf\Sigma}|^{n/2}}{|\hat{\mathbf\Sigma_0}|^{n/2}}=\Biggl(\frac{\Biggl|\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T\Biggr|}{\Biggl|\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol\mu_0)(\mathbf x_j-\boldsymbol\mu_0)^T\Biggr|}\Biggr)^{n/2}\) The \(\Lambda^{\frac{2}{n}}=\frac{|\hat{\mathbf\Sigma}|}{|\hat{\mathbf\Sigma_0}|}\) is called Wilks’ lambda.

  • Because \[\begin{align} \Biggl|\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol\mu_0)(\mathbf x_j-\boldsymbol\mu_0)^T\Biggr|&=\Biggl|\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T+n(\overline{\mathbf x}-\boldsymbol\mu_0)(\overline{\mathbf x}-\boldsymbol\mu_0)^T\Biggr|\\ &=\Biggl|\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T\Biggr|\Biggl|1+n(\overline{\mathbf x}-\boldsymbol\mu_0)\Bigl(\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T\Bigr)^{-1}(\overline{\mathbf x}-\boldsymbol\mu_0)^T\Biggr|\\ &=\Biggl|\displaystyle\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T\Biggr|\Bigl(1+\frac{T^2}{(n-1)}\Bigr) \end{align}\] which means \(\Bigl|n\hat{\mathbf\Sigma_0}\Bigr|=\Bigl|n\hat{\mathbf\Sigma}\Bigr|\Bigl(1+\frac{T^2}{(n-1)}\Bigr)\) then \(\Lambda^{\frac{2}{n}}=\frac{|\hat{\mathbf\Sigma}|}{|\hat{\mathbf\Sigma_0}|}=\Bigl(1+\frac{T^2}{(n-1)}\Bigr)^{-1}\) Here \(H_0\) is rejected for small values of \(\Lambda^{\frac{2}{n}}\) or equivalently, large values of \(T^2\).

  • Because \(T^2=n(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu)\) is distributed as \(\frac{(n-1)p}{n-p}F_{p,n-p}\), a \(100(1-\alpha)\%\) confidence region for the mean of a \(p\)-dimensional normal distribution is the ellipsoid determined by all \(\boldsymbol\mu\) that \(n(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu)\le\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\), and we can compare the generalized squared distance of \(\boldsymbol\mu_0\) with \(\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\), if \(n(\overline {\mathbf X}-\boldsymbol\mu_0)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu_0)>\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\), \(\boldsymbol\mu_0\) is not in the confidence region. If \(\lambda_i\) and \(\mathbf e_i\) are the eigenvalues and eigenvectors of \(\mathbf S\), \(\sqrt{(\overline {\mathbf X}-\boldsymbol\mu)^T(\mathbf S)^{-1}(\overline {\mathbf X}-\boldsymbol\mu)}=\sqrt{\frac{(n-1)p}{n(n-p)}F_{p,n-p}(\alpha)}\), then the axes of the confidence \(\alpha\) ellipsoid beginning at the center \(\overline {\mathbf X}\) are \(\sqrt{\lambda_i}\mathbf e_i\sqrt{\frac{(n-1)p}{n(n-p)}F_{p,n-p}(\alpha)}\). The \(100(1-\alpha)\%\) simultaneous confidence intervals for the individual component means of a mean vector are \(\Biggl(\overline X_i-\sqrt{\frac{(n-1)p}{(n-p)}F_{p,n-p}(\alpha)}\sqrt{\frac{s_{ii}}{n}},\quad \overline X_i+\sqrt{\frac{(n-1)p}{(n-p)}F_{p,n-p}(\alpha)}\sqrt{\frac{s_{ii}}{n}}\Biggl)\)

  • The linear combination of the components of \(\mathbf X\), which has \(N_p(\boldsymbol\mu, \mathbf\Sigma)\) distribution is \(Z_j=a_1X_{j1}+a_2X_{j2}+\cdots+a_pX_{jp}=\mathbf a^T\mathbf X_j \quad j=1,2,\cdots,n\), with sample mean \(\overline z=\mathbf a^T\overline {\mathbf X}\) and \(s_z^2=\mathbf a^T\mathbf S\mathbf a\), then \(\displaystyle\frac{\overline z-\mu_z}{s_z/\sqrt{n}}, \quad \mu_z=\mathbf a^T\boldsymbol\mu\) will be distributed as \(t\). So a \(100(1-\alpha)\%\) confidence region for the mean of \(z\): \(\mu_z=\mathbf a^T\boldsymbol\mu\) is based on student’s \(t\)-ratio \(t=\displaystyle\frac{\overline z-\mu_z}{s_z/\sqrt{n}}=\displaystyle\frac{\sqrt{n}(\mathbf a^T\overline {\mathbf X}-\mathbf a^T\boldsymbol\mu)}{\sqrt{\mathbf a^T\mathbf S\mathbf a}}\) and leads to the statement \(\mathbf a^T\overline {\mathbf X}-t_{n-1}(\alpha/2)\displaystyle\frac{\sqrt{\mathbf a^T\mathbf S\mathbf a}}{\sqrt{n}}\le\mathbf a^T\boldsymbol\mu\le\mathbf a^T\overline {\mathbf X}+t_{n-1}(\alpha/2)\displaystyle\frac{\sqrt{\mathbf a^T\mathbf S\mathbf a}}{\sqrt{n}}\)

  • Based on the Cauchy–Schwarz inequality, \(t^2=\Bigl(\displaystyle\frac{\sqrt{n}(\mathbf a^T\overline {\mathbf X}-\mathbf a^T\boldsymbol\mu)}{\sqrt{\mathbf a^T\mathbf S\mathbf a}}\Bigr)^2=n\frac{(\mathbf a^T(\overline {\mathbf X}-\boldsymbol\mu))^2}{\mathbf a^T\mathbf S\mathbf a}\le n\frac{(\mathbf a^T\mathbf S\mathbf a)(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu)}{\mathbf a^T\mathbf S\mathbf a}=n(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu)=T^2\) then \(|\mathbf a^T\overline {\mathbf X}-\mathbf a^T\boldsymbol\mu|\le\sqrt{(\overline {\mathbf X}-\boldsymbol\mu)^T\mathbf S^{-1}(\overline {\mathbf X}-\boldsymbol\mu)\mathbf a^T\mathbf S\mathbf a}\le\sqrt{\frac{(n-1)p}{n(n-p)}F_{p,n-p}(\alpha)\mathbf a^T\mathbf S\mathbf a}\) and \(\mathbf a^T\boldsymbol\mu\) is contained in the interval \(\mathbf a^T\overline {\mathbf X}-\sqrt{\frac{(n-1)p}{n(n-p)}F_{p,n-p}(\alpha)\mathbf a^T\mathbf S\mathbf a}\le\mathbf a^T\boldsymbol\mu\le\mathbf a^T\overline {\mathbf X}+\sqrt{\frac{(n-1)p}{n(n-p)}F_{p,n-p}(\alpha)\mathbf a^T\mathbf S\mathbf a}\) with probability \(1-\alpha\).

  • If the confidence about the value of linear combinations of the mean components \(\mathbf a_i^T\boldsymbol\mu, \quad i=1,2,\cdots,m\) is \(P[\mathbf a_i^T\boldsymbol\mu\text{ is True}]=1-\alpha_i\), then \(P[\text{all }\mathbf a_i^T\boldsymbol\mu\text{ are True}]=\displaystyle\prod_{i=1}^{m}(1-\alpha_i)\ge1-\displaystyle\sum_{i=1}^{m}\alpha_i\), if the simultaneous confidence interval about the components \(\mu_i\) of \(\boldsymbol\mu_i\) is \(\alpha\), and \(\alpha_i=\alpha/m\), then the individual \(t\)-intervals will be \(\overline X_i\pm t_{n-1}(\frac{\alpha_i}{2})\sqrt{\frac{s_{ii}}{n}}\) then \(P[\overline X_i\pm t_{n-1}(\frac{\alpha}{2m})\sqrt{\frac{s_{ii}}{n}}\text{ contains }\mu_i]=\displaystyle\prod_{i=1}^{m}(1-\alpha/m)\ge1-\displaystyle\sum_{i=1}^{m}\alpha/m=1-\alpha\) This method is called Bonferroni Method of Multiple Comparisons.

  • Let \(\mathbf X_1, \mathbf X_2, \cdots, \mathbf X_n\) be independently distributed as \(N_p(\boldsymbol \mu,\mathbf\Sigma)\), and \(\mathbf X\) is a future observation from the same distribution, then \(E(\mathbf X-\overline{\mathbf X})=0\) and \(Cov(\mathbf X-\overline{\mathbf X})=Cov(\mathbf X)+Cov(\overline{\mathbf X})=\mathbf\Sigma+\frac{1}{n}\mathbf\Sigma=\frac{n+1}{n}\mathbf\Sigma\), then \(\frac{(\mathbf X-\overline{\mathbf X})}{\sqrt{\frac{n+1}{n}}}\) is distributed as \(N_p(\mathbf 0,\mathbf\Sigma)\), now \(\sqrt{\frac{n}{n+1}}(\mathbf X-\overline{\mathbf X})^T\mathbf S^{-1}\sqrt{\frac{n}{n+1}}(\mathbf X-\overline{\mathbf X})=\frac{n}{n+1}(\mathbf X-\overline{\mathbf X})^T\mathbf S^{-1}(\mathbf X-\overline{\mathbf X})\) distributed as \(\frac{(n-1)p}{(n-p)}F_{p,n-p}(\alpha)\) Then \(P\Bigl[(\mathbf X-\overline{\mathbf X})^T\mathbf S^{-1}(\mathbf X-\overline{\mathbf X})\le\frac{(n^2-1)p}{n(n-p)}F_{p,n-p}(\alpha)\Bigr]=1-\alpha\) which is the control ellipse for future observations.