3 min read

Randomized one-factor design and the analysis of variance (ANOVA)

If we want to compare the average effects elicited by \(k\) different levels of some given factor, there will be \(k\) independent random samples of sizes \(n_j\quad (j=1,2,...,k)\), the total sample size is \(n=\displaystyle\sum_{j=1}^{k}n_j\). Let \(Y_{ij}\) represent the \(i^{th}\) observation recorded for the \(j^{th}\) level. \[ \begin{array}{|c|cccc|} \hline &&\text{treatment}&\text{levels}&\\ \hline & 1 & 2 & \cdots & k \\ \hline & Y_{11} & Y_{12} & \cdots & Y_{1k} \\ & Y_{21} & Y_{22} & \cdots & Y_{2k} \\ &\vdots &\vdots &\cdots&\vdots \\ &Y_{n_11} &Y_{n_22} &\cdots&Y_{n_kk} \\ \text{Sample sizes:}&n_1&n_2&\cdots&n_k\\ \text{Sample totals:}&T_{. 1}&T_{. 2}&\cdots&T_{. k}\\ \text{Sample means:}&\overline Y_{. 1}&\overline Y_{. 2}&\cdots&\overline Y_{. k}\\ \text{True means:}&\mu_1&\mu_2&\cdots&\mu_k\\ \hline \end{array} \]

Where \[T_{.j}=\displaystyle\sum_{i=1}^{n_j}Y_{ij}\], \[\overline Y_{.j}=\frac{1}{n_j}\displaystyle\sum_{i=1}^{n_j}Y_{ij}=\frac{T_{.j}}{n_j}\], and the overall total is \[T_{..}=\sum_{j=1}^{k}\sum_{i=1}^{n_j}Y_{ij}=\sum_{j=1}^{k}T_{.j}\], overall mean is \[\overline Y_{..}=\frac{1}{n}T_{..}\]

Now we presume that for each \(j\), \(Y_{1j}, Y_{2j}, . . . ,Y_{njj}\) is a random sample from a normal distribution and independent with each other with mean \(\mu_j,j = 1, 2, . . . , k,\) and variance \(\sigma^2\) (constant for all \(j\)). The maximum likelihood estimator of \(μ_j\) is \(\overline Y_{.j}\) and the maximum likelihood estimator of \(μ\) is \(\overline Y_{..}\).

The sum of squares of treatment (\(SSTR\)) estimates the variation among the \(\mu_j\)’s and is defined by \[\begin{align} SSTR=\sum_{j=1}^{k}\sum_{i=1}^{n_j}(\overline Y_{.j}-\overline Y_{..})^2 &=\sum_{j=1}^{k}n_j(\overline Y_{.j}-\overline Y_{..})^2\\ &=\sum_{j=1}^{k}n_j\Bigl[(\overline Y_{.j}-\mu)-(\overline Y_{..}-\mu)\Bigr]^2\\ &=\sum_{j=1}^{k}n_j\Bigl[(\overline Y_{.j}-\mu)^2-2(\overline Y_{.j}-\mu)(\overline Y_{..}-\mu)+(\overline Y_{..}-\mu)^2\Bigr]\\ &=\sum_{j=1}^{k}n_j(\overline Y_{.j}-\mu)^2+\sum_{j=1}^{k}n_j(\overline Y_{..}-\mu)^2-2\sum_{j=1}^{k}n_j(\overline Y_{.j}-\mu)(\overline Y_{..}-\mu)\\ &=\sum_{j=1}^{k}n_j(\overline Y_{.j}-\mu)^2+n(\overline Y_{..}-\mu)^2-2(\overline Y_{..}-\mu)n(\overline Y_{..}-\mu)\\ &=\sum_{j=1}^{k}n_j(\overline Y_{.j}-\mu)^2-n(\overline Y_{..}-\mu)^2\\ &=\sum_{j=1}^{k}n_j\overline Y_{.j}^2-2\sum_{j=1}^{k}n_j\overline Y_{.j}\mu+\sum_{j=1}^{k}n_j\mu^2-n\overline Y_{..}^2+2n\overline Y_{..}\mu-n\mu^2\\ &=\sum_{j=1}^{k}n_j\overline Y_{.j}^2-n\overline Y_{..}^2\\ &=\sum_{j=1}^{k}\frac{\overline T_{.j}^2}{n_j}-\frac{T_{..}^2}{n} \end{align}\]. Then, \[\begin{align} E(SSTR)&=E\Bigl(\sum_{j=1}^{k}n_j(\overline Y_{.j}-\mu)^2-n(\overline Y_{..}-\mu)^2\Bigr)\\ &=\sum_{j=1}^{k}n_jE(\overline Y_{.j}-\mu)^2-nE(\overline Y_{..}-\mu)^2\\ &=\sum_{j=1}^{k}n_j\Bigl(Var(\overline Y_{.j}-\mu)+(E(\overline Y_{.j}-\mu))^2\Bigr)-n\frac{\sigma^2}{n}\\ &=\sum_{j=1}^{k}n_j(\frac{\sigma^2}{n_j})+\sum_{j=1}^{k}n_j(E(\overline Y_{.j}-\mu))^2-\sigma^2\\ &=(k-1)\sigma^2+\sum_{j=1}^{k}n_j(\mu_j-\mu)^2 \end{align}\].

When \(\sigma^2\) is known, the null hypothesis that the treatment level means are all equal \(H0:\mu_1 = \mu_2 = \cdots = \mu_k=\mu\) is true, \(E(SSTR)=(k-1)\sigma^2\), and \(\frac{SSTR}{\sigma^2}\) has a \(\chi^2\) distribution with \(k − 1\) degrees of freedom.

When \(\sigma^2\) is unknown, the \(j^{th}\) sample variance is: \[S_j^2=\frac{1}{n_j-1}\displaystyle\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})^2\], and the sum of squares of error (SSE) is: \[SSE=\sum_{j=1}^{k}(n_j-1)S_j^2=\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})^2\].

Whether or not \(\mu_1 = \mu_2 = \cdots = \mu_k\) is true, \((n_j-1)S_j^2/\sigma^2\) has a \(\chi^2\) distribution with \(n_j − 1\) degrees of freedom, and \(SSE/\sigma^2\) has a \(\chi^2\) distribution with \(n − k\) degrees of freedom.

The Sum of squares of total (SSTOT) is the variation of the data about the parameter \(\mu\), \[\begin{align} SSTOT&=\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{..})^2\\ &=\sum_{j=1}^{k}\sum_{i=1}^{n_j}\Bigl[(Y_{ij}-\overline Y_{.j})+(\overline Y_{.j}-\overline Y_{..})\Bigr]^2\\ &=\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})^2+\sum_{j=1}^{k}\sum_{i=1}^{n_j}(\overline Y_{.j}-\overline Y_{..})^2+2\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})(\overline Y_{.j}-\overline Y_{..})\\ &=\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})^2+\sum_{j=1}^{k}\sum_{i=1}^{n_j}(\overline Y_{.j}-\overline Y_{..})^2+2\sum_{j=1}^{k}(\overline Y_{.j}-\overline Y_{..})\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})\\ &=\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\overline Y_{.j})^2+\sum_{j=1}^{k}\sum_{i=1}^{n_j}(\overline Y_{.j}-\overline Y_{..})^2\\ &=SSE+SSTR \end{align}\]

Because \(SSTR/\sigma^2\) and \(SSE/\sigma^2\) are independent \(\chi^2\) square random variables with \(k − 1\) and \(n − k\) df,
if \(H0:\mu_1 = \mu_2 = \cdots = \mu_k\) is true, \[F=\frac{SSTR/(k-1)}{SSE/(n-k)}\] has a \(F\) distribution with \(k − 1\) and \(n − k\) df.

ANOVA table for testing \(H0:\mu_1 = \mu_2 = \cdots = \mu_k\):

\[ \begin{array}{lccccc} \hline Source & df & SS & MS & F & P \\ \hline Treatment & k-1 & SSTR & MSTR & \frac{MSTR}{MSE} & P(F_{k−1,n− k} \ge F(observed)) \\ Error & n-k & SSE & MSE & \\ Total & n-1 & SSTOT\\ \hline \end{array} \] \(F\)-test rejects \(H0:\mu_1 = \mu_2 = \cdots = \mu_k\) at \(\alpha\) if \(F=\frac{MSTR}{MSE}>F_{k−1,n− k}(\alpha)\)