Classification - A Hugo website

Two classes \(\pi_1\) and \(\pi_2\) have the prior probability \(p_1\) and \(p_2\) separately and \(p_1+p_2=1\). The probabilities of the random variable \(\mathbf x\) belong to the 2 classes follow the density function \(f_1(\mathbf x)\) and \(f_2(\mathbf x)\) over the region \(R_1+R_2\) and \(\underset{R_1}{\int} f_1(\mathbf x)dx=P(1|1)\), \(\underset{R_2}{\int} f_1(\mathbf x)dx=P(2|1)\), \(\underset{R_2}{\int} f_2(\mathbf x)dx=P(2|2)\), \(\underset{R_1}{\int} f_2(\mathbf x)dx=P(1|2)\). Then the probability of observation \(\mathbf x\) which comes from class \(\pi_1\) and is correctly classified as \(\pi_1\) is the conditional probability \[P(\mathbf x\in R_1|\pi_1)P(\pi_1)=P(1|1)p_1\], and observation \(\mathbf x\) is misclassified as \(\pi_1\) is \[P(\mathbf x\in R_1|\pi_2)P(\pi_2)=P(1|2)p_2\] Similarly, observation is correctly classified as \(\pi_2\) is the conditional probability \[P(\mathbf x\in R_2|\pi_2)P(\pi_2)=P(2|2)p_2\], and observation is misclassified as \(\pi_2\) is \[P(\mathbf x\in R_2|\pi_1)P(\pi_1)=P(2|1)p_1\] The costs of misclassification can be defined by a cost matrix \[\begin{array}{cc|cc} &&\text{Classify as:}\\ &&\pi_1&\pi_2\\ \hline\\ \text{True populations:}&\pi_1&0&c(2|1)\\ &\pi_2&c(1|2)&0\\ \end{array}\] Then the Expected Cost of Misclassification (ECM) is provided by \[\begin{bmatrix} P(2|1)&P(1|2)\\ \end{bmatrix}\begin{bmatrix} 0&c(2|1)\\ c(1|2)&0\\ \end{bmatrix}\begin{bmatrix} p_2\\ p_1\\ \end{bmatrix}=P(1|2)c(1|2)p_2+P(2|1)c(2|1)p_1\] A reasonable classification rule should have an ECM as small as possible. The regions boundary between \(R_1\) and \(R_2\) that minimize the ECM are defined by the values \(\mathbf x\) for which \[\frac{f_1(\mathbf x)}{f_2(\mathbf x)}=\frac{c(1|2)p_2}{c(2|1)p_1}\] We classify a new observation \(\mathbf x_0\) into \(\pi_1\) if \[\frac{f_1(\mathbf x_0)}{f_2(\mathbf x_0)}\ge\frac{c(1|2)p_2}{c(2|1)p_1}\] or into \(\pi_2\) if \[\frac{f_1(\mathbf x_0)}{f_2(\mathbf x_0)}\le\frac{c(1|2)p_2}{c(2|1)p_1}\]
The Total Probability of Misclassification (TPM) is provided by \(\text{TPM}=P(2|1)p_1+P(1|2)p_2\)

Assume that \(f_1(\mathbf x)\) and \(f_2(\mathbf x)\) are multivariate normal densities share the same covariance matrix \(\boldsymbol\Sigma\) and are distributed as \(N(\boldsymbol\mu_1, \boldsymbol\Sigma)\) and \(N(\boldsymbol\mu_2, \boldsymbol\Sigma)\) The joint densities of a \(p\)-dimensional normal random vector \(\mathbf X^T=[X_1,X_2,\cdots,X_p]\) for populations \(\pi_1\) and \(\pi_2\) has the form \[ f_i(\mathbf x)=\frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}}e^{-\frac{1}{2}(\mathbf x-\boldsymbol \mu_i)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_i)}, \quad i=1,2\] then \[R_1: \quad\frac{f_1(\mathbf x)}{f_2(\mathbf x)}=exp\Bigl[-\frac{1}{2}(\mathbf x-\boldsymbol \mu_1)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_1)+\frac{1}{2}(\mathbf x-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_2)\Bigr]\ge\frac{c(1|2)p_2}{c(2|1)p_1}\] \[R_2: \quad\frac{f_1(\mathbf x)}{f_2(\mathbf x)}=exp\Bigl[-\frac{1}{2}(\mathbf x-\boldsymbol \mu_1)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_1)+\frac{1}{2}(\mathbf x-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_2)\Bigr]\le\frac{c(1|2)p_2}{c(2|1)p_1}\] And because \[-\frac{1}{2}(\mathbf x-\boldsymbol \mu_1)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_1)+\frac{1}{2}(\mathbf x-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}(\mathbf x-\boldsymbol \mu_2)=(\boldsymbol \mu_1-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}\mathbf x-\frac{1}{2}(\boldsymbol \mu_1-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}(\boldsymbol \mu_1+\boldsymbol \mu_2)\] then \[R_1: \quad (\boldsymbol \mu_1-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}\mathbf x-\frac{1}{2}(\boldsymbol \mu_1-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}(\boldsymbol \mu_1+\boldsymbol \mu_2)\ge ln(\frac{c(1|2)p_2}{c(2|1)p_1})\] \[R_2: \quad (\boldsymbol \mu_1-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}\mathbf x-\frac{1}{2}(\boldsymbol \mu_1-\boldsymbol \mu_2)^T(\mathbf\Sigma)^{-1}(\boldsymbol \mu_1+\boldsymbol \mu_2)< ln(\frac{c(1|2)p_2}{c(2|1)p_1})\]
When we have \(n_1\) observations of the multivariate random variable from \(\pi_1\) and \(n_2\) observations of the multivariate random variable from \(\pi_2\) \[\underset{n_1\times p}{\mathbf X_1}=\begin{bmatrix} \mathbf x_{11}^T\\ \mathbf x_{12}^T\\ \vdots\\ \mathbf x_{1n_1}^T \end{bmatrix}\] and \[\underset{n_2\times p}{\mathbf X_2}=\begin{bmatrix} \mathbf x_{21}^T\\ \mathbf x_{22}^T\\ \vdots\\ \mathbf x_{2n_2}^T \end{bmatrix}\] \[\underset{(p\times 1)}{\bar{\mathbf x}_1}=\frac{1}{n_1}\sum_{j=1}^{n_1}\mathbf x_{1j}\] is the unbiased estimate of \(\boldsymbol \mu_1\), \[\underset{(p\times p)}{\mathbf S_1}=\frac{1}{n_1-1}\sum_{j=1}^{n_1}(\mathbf x_{1j}-\bar{\mathbf x}_1)(\mathbf x_{1j}-\bar{\mathbf x}_1)^T\] \[\underset{(p\times 1)}{\bar{\mathbf x}_2}=\frac{1}{n_2}\sum_{j=1}^{n_2}\mathbf x_{2j}\] is the unbiased estimate of \(\boldsymbol \mu_2\), \[\underset{(p\times p)}{\mathbf S_2}=\frac{1}{n_2-1}\sum_{j=1}^{n_2}(\mathbf x_{2j}-\bar{\mathbf x}_2)(\mathbf x_{2j}-\bar{\mathbf x}_2)^T\] Let \[\mathbf S=\frac{n_1-1}{n_1-1+n_2-1}\mathbf S_1+\frac{n_2-1}{n_1-1+n_2-1}\mathbf S_2\], which is the unbiased estimate of \(\mathbf\Sigma\) Then the estimated minimum Expected Cost of Misclassification (ECM) rule for two Normal populations \[R_1: \quad (\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}\mathbf x_0-\frac{1}{2}(\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}(\bar{\mathbf x}_1+\bar{\mathbf x}_2)\ge ln(\frac{c(1|2)p_2}{c(2|1)p_1})\] \[R_2: \quad (\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}\mathbf x_0-\frac{1}{2}(\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}(\bar{\mathbf x}_1+\bar{\mathbf x}_2)< ln(\frac{c(1|2)p_2}{c(2|1)p_1})\]
Fisher’s approach to Classification with Two Populations with the same covariance matrix : A fixed linear combination of the \(n_1\) observations of the multivariate random variable from \(\pi_1\) and \(n_2\) observations of the multivariate random variable from \(\pi_2\) \[\underset{(n_1\times p)}{\mathbf X_1}\underset{(p\times 1)}{\mathbf a}=\begin{bmatrix} \mathbf x_{11}^T\\ \mathbf x_{12}^T\\ \vdots\\ \mathbf x_{1n_1}^T \end{bmatrix}\mathbf a=\begin{bmatrix} y_{11}\\ y_{12}\\ \vdots\\ y_{1n_1}\\ \end{bmatrix}\] \[\underset{(n_2\times p)}{\mathbf X_2}\underset{(p\times 1)}{\mathbf a}=\begin{bmatrix} \mathbf x_{21}^T\\ \mathbf x_{22}^T\\ \vdots\\ \mathbf x_{2n_2}^T \end{bmatrix}\mathbf a=\begin{bmatrix} y_{21}\\ y_{22}\\ \vdots\\ y_{2n_2}\\ \end{bmatrix}\] The objective is to select the linear combination of the \(\mathbf x\) to achieve maximum separation of the sample means \(\bar{y}_1\) and \(\bar{y}_2\), that is to make \[\text{separation}=\frac{|\bar{y}_1-\bar{y}_2|}{s_y}\] as large as possible, where \[s_y^2=\frac{\displaystyle\sum_{j=1}^{n_1}(y_{1j}-\bar{y}_1)^2+\displaystyle\sum_{j=1}^{n_2}(y_{2j}-\bar{y}_2)^2}{n_1+n_2-2}\] is the pooled estimate of the variance. Because \(\bar{y}_1=\mathbf a^T\bar{\mathbf x}_1\) \(\bar{y}_2=\mathbf a^T\bar{\mathbf x}_2\) \(s_y^2=\mathbf a^T\mathbf S\mathbf a\) then \[\text{separation}^2=\frac{(\bar{y}_1-\bar{y}_2)^2}{s_y^2}=\frac{(\mathbf a^T\bar{\mathbf x}_1-\mathbf a^T\bar{\mathbf x}_2)^2}{\mathbf a^T\mathbf S\mathbf a}=\frac{(\mathbf a^T(\bar{\mathbf x}_1-\bar{\mathbf x}_2))^2}{\mathbf a^T\mathbf S\mathbf a}\le (\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}(\bar{\mathbf x}_1-\bar{\mathbf x}_2)\] The maximum is achieved when \(\mathbf a=\mathbf S^{-1}(\bar{\mathbf x}_1-\bar{\mathbf x}_2)\) and allocate new observation \(\mathbf x_0\) to \[R_1: \quad (\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}\mathbf x_0 \ge\frac{1}{2}(\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}(\bar{\mathbf x}_1+\bar{\mathbf x}_2)\] \[R_2: \quad (\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}\mathbf x_0<\frac{1}{2}(\bar{\mathbf x}_1-\bar{\mathbf x}_2)^T\mathbf S^{-1}(\bar{\mathbf x}_1+\bar{\mathbf x}_2)\] Fisher’s classification rule is equivalent to the minimum ECM rule with equal prior probabilities and equal costs of misclassification.
Classification with several populations \(\pi_i,i=1,2,\cdots,g\) with the prior probability \(p_i\) and the cost of allocating an item to \(\pi_k\) when it belongs \(\pi_i\) is \(c(k|i), k,i=1,2,\cdots,g\) and the probability classifying item as \(\pi_k\) when it belongs \(\pi_i\) is \(P(k|i)=\int_{R_k}f_i(\mathbf x)d\mathbf x\). Then the Expected Cost of Misclassification (ECM) of \(\pi_k\) is \[\text{ECM}(k)=P(1|k)c(1|k)+P(2|k)c(2|k)+\cdots+P(k-1|k)c(k-1|k)+P(k+1|k)c(k+1|k)+\cdots+P(g|k)c(g|k)\\ =\sum_{j=1, j\ne k}^{g}P(j|k)c(j|k)\] Multiply each conditional ECM by its prior probability and summing gives the overall ECM \[\text{ECM}=\sum_{k=1}^{g}p_k\text{ECM}(k)=\sum_{k=1}^{g}p_k\Biggl(\sum_{j=1, j\ne k}^{g}P(j|k)c(j|k)\Biggr)\] We will choose the classification regions \(R_k\) to minimize the total ECM. We can assign new observation \(\mathbf x\) to region \(R_k\) that has the smallest \[\sum_{j=1, j\ne k}^{g}p_jf_j(\mathbf x)c(k|j)\] or largest \[p_kf_k(\mathbf x)\] when all the misclassification costs are equal. When \(f_i(\mathbf x)\) follow normal distributions \[ f_i(\mathbf x)=\frac{1}{(2\pi)^{p/2}|\mathbf\Sigma_i|^{1/2}}e^{-\frac{1}{2}(\mathbf x-\boldsymbol \mu_i)^T\mathbf\Sigma_i^{-1}(\mathbf x-\boldsymbol \mu_i)}\] we can allocate \(\mathbf x\) to \(\pi_k\) if \[ln(p_kf_k(\mathbf x))=ln(p_k)-\frac{p}{2}ln(2\pi)-\frac{1}{2}ln|\mathbf\Sigma_k|-\frac{1}{2}(\mathbf x-\boldsymbol \mu_k)^T\mathbf\Sigma_k^{-1}(\mathbf x-\boldsymbol \mu_k)\] is the largest. quadratic discrimination score for the \(k^{th}\) population is defined as \[d_k^Q(\mathbf x)=ln(p_k)-\frac{1}{2}ln|\mathbf\Sigma_k|-\frac{1}{2}(\mathbf x-\boldsymbol \mu_k)^T\mathbf\Sigma_k^{-1}(\mathbf x-\boldsymbol \mu_k)\] Then we can allocate \(\mathbf x\) to \(\pi_k\) if it has the largest quadratic score \(d_k^Q(\mathbf x)\) The estimate of the quadratic discrimination score is \[\hat{d}_k^Q(\mathbf x)=ln(p_k)-\frac{1}{2}ln|\mathbf S_k|-\frac{1}{2}(\mathbf x-\bar{\mathbf x}_k)^T\mathbf S_k^{-1}(\mathbf x-\bar{\mathbf x}_k)\] Then we can allocate \(\mathbf x\) to \(\pi_k\) if it has the largest estimate quadratic score \(\hat{d}_k^Q(\mathbf x)\) And if the population covariance matrices \(\mathbf\Sigma_k\) are equal, it can be estimated using the pooled \[\mathbf S=\frac{1}{n_1+n_2+\cdots+n_g-g}\Biggl((n_1-1)\mathbf S_1+(n_2-1)\mathbf S_2+\cdots+(n_g-1)\mathbf S_g\Biggr)\] Then the estimate of the quadratic discrimination score is \[\hat{d}_k^Q(\mathbf x)=ln(p_k)-\frac{1}{2}ln|\mathbf S|-\frac{1}{2}(\mathbf x-\bar{\mathbf x}_k)^T\mathbf S^{-1}(\mathbf x-\bar{\mathbf x}_k)\\ =ln(p_k)-\frac{1}{2}ln|\mathbf S|-\frac{1}{2}D_k^2(\mathbf x)\] with \(D_k^2(\mathbf x)=(\mathbf x-\bar{\mathbf x}_k)^T\mathbf S^{-1}(\mathbf x-\bar{\mathbf x}_k)\), which is the squared distances between the new observation \(\mathbf x\) and sample mean \(\bar{\mathbf x}_k\). Then we can allocate \(\mathbf x\) to \(\pi_k\) if the quadratic discrimination score \(\hat{d}_k^Q(\mathbf x)\) is the largest or the squared distances \(D_k^2(\mathbf x)\) is smallest. Or equivalently, we can allocate \(\mathbf x\) to \(\pi_k\) when \[\hat{d}_k^Q(\mathbf x)-\hat{d}_i^Q(\mathbf x)=ln(\frac{p_k}{p_i})-\frac{1}{2}\Biggl[(\mathbf x-\bar{\mathbf x}_k)^T\mathbf S^{-1}(\mathbf x-\bar{\mathbf x}_k)-(\mathbf x-\bar{\mathbf x}_i)^T\mathbf S^{-1}(\mathbf x-\bar{\mathbf x}_i)\Biggr]\\ =ln(\frac{p_k}{p_i})-\Biggl[-\bar{\mathbf x}_k^T\mathbf S^{-1}\mathbf x+\frac{1}{2}\bar{\mathbf x}_k^T\mathbf S^{-1}\bar{\mathbf x}_k+\bar{\mathbf x}_i^T\mathbf S^{-1}\mathbf x-\frac{1}{2}\bar{\mathbf x}_i^T\mathbf S^{-1}\bar{\mathbf x}_i\Biggr]\\ =ln(\frac{p_k}{p_i})+\Biggl[(\bar{\mathbf x}_k-\bar{\mathbf x}_i)^T\mathbf S^{-1}\mathbf x-\frac{1}{2}(\bar{\mathbf x}_k-\bar{\mathbf x}_i)^T\mathbf S^{-1}(\bar{\mathbf x}_k+\bar{\mathbf x}_i)\Biggr]\ge 0\] for all \(i=1,2,\cdots,g\)
Fisher’s approach to Classification with \(g\) Populations with the same covariance matrix \(\mathbf\Sigma\) which is full rank: A fixed linear combination of the \(n_i\) observations of the multivariate random variable from \(\pi_i, i=1,2,\cdots,g\) is \[\underset{(n_i\times p)}{\mathbf X_i}\underset{(p\times 1)}{\mathbf a}=\begin{bmatrix} \mathbf x_{i1}^T\\ \mathbf x_{i2}^T\\ \vdots\\ \mathbf x_{in_i}^T \end{bmatrix}\mathbf a=\begin{bmatrix} y_{i1}\\ y_{i2}\\ \vdots\\ y_{in_i}\\ \end{bmatrix}=Y_i\] \(E(\mathbf X_i)=\boldsymbol\mu_i\) and \(Cov(\mathbf X_i)=\mathbf\Sigma\) then \(E(Y_i)=\mu_{iY}=\mathbf a^T\boldsymbol\mu_i\) and \(Cov(Y_i)=\mathbf a^T\mathbf\Sigma\mathbf a\) which is the same for all populations. The the overall mean of all populations is \[\bar{\boldsymbol\mu}=\frac{1}{g}\sum_{i=1}^{g}\boldsymbol\mu_i\] and the overall mean of all \(Y_i\) is \[\bar{\mu}_Y=\mathbf a^T\bar{\boldsymbol\mu}\] Then for the squared separation \[\text{separation}^2=\frac{\displaystyle\sum_{i=1}^{g}(\mu_{iY}-\bar{\mu}_Y)^2}{\sigma_Y^2}=\frac{\displaystyle\sum_{i=1}^{g}(\mathbf a^T\boldsymbol\mu_i-\mathbf a^T\bar{\boldsymbol\mu})^2}{\mathbf a^T\mathbf \Sigma\mathbf a}=\frac{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})(\boldsymbol\mu_i-\bar{\boldsymbol\mu})^T\Biggr)\mathbf a}{\mathbf a^T\mathbf \Sigma\mathbf a}\] The squared separation measures the variability between the groups of \(Y\)-values relative to the common variability within groups. We can then select \(\mathbf a\) to maximize this ratio. For the sample mean vectors \[\bar{\mathbf x}_i=\frac{1}{n_i}\sum_{j=1}^{n_i}\mathbf x_{ij}\] the mean vector is \[\bar{\mathbf x}=\frac{1}{g}\sum_{i=1}^{g}\bar{\mathbf x}_i\] and \[\mathbf S=\frac{\displaystyle\sum_{i=1}^{g}\sum_{j=1}^{n_i}(\mathbf x_{ij}-\bar{\mathbf x}_i)(\mathbf x_{ij}-\bar{\mathbf x}_i)^T}{n_1+n_2+\cdots+n_g-g}\] is the estimate of \(\mathbf\Sigma\) Then for the squared separation \[\text{separation}^2=\frac{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}(\bar{\mathbf x}_i-\bar{\mathbf x})(\bar{\mathbf x}_i-\bar{\mathbf x})^T\Biggr)\mathbf a}{\mathbf a^T\mathbf S\mathbf a}\] Or \[\frac{\text{separation}^2}{n_1+n_2+\cdots+n_g-g}=\frac{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}(\bar{\mathbf x}_i-\bar{\mathbf x})(\bar{\mathbf x}_i-\bar{\mathbf x})^T\Biggr)\mathbf a}{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}\sum_{j=1}^{n_i}(\mathbf x_{ij}-\bar{\mathbf x}_i)(\mathbf x_{ij}-\bar{\mathbf x}_i)^T\Biggr)\mathbf a}\] Let \((\lambda_1, \mathbf e_1),(\lambda_2, \mathbf e_2),\cdots,(\lambda_s, \mathbf e_s), s\le \text{min}(g-1,p)\) are the eigenvalue-eigenvector pairs of matrix \[\Biggl(\displaystyle\sum_{i=1}^{g}\sum_{j=1}^{n_i}(\mathbf x_{ij}-\bar{\mathbf x}_i)(\mathbf x_{ij}-\bar{\mathbf x}_i)^T\Biggr)^{-1}\Biggl(\displaystyle\sum_{i=1}^{g}(\bar{\mathbf x}_i-\bar{\mathbf x})(\bar{\mathbf x}_i-\bar{\mathbf x})^T\Biggr)\] Then the vector of coefficients \(\mathbf a\) that maximizes the ratio \[\frac{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}(\bar{\mathbf x}_i-\bar{\mathbf x})(\bar{\mathbf x}_i-\bar{\mathbf x})^T\Biggr)\mathbf a}{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}\sum_{j=1}^{n_i}(\mathbf x_{ij}-\bar{\mathbf x}_i)(\mathbf x_{ij}-\bar{\mathbf x}_i)^T\Biggr)\mathbf a}\] is given by \(\mathbf e_1\) and the linear combination \(\mathbf e_1^T\mathbf x\) is called the sample first discriminant, and the linear combination \(\mathbf e_k^T\mathbf x\) is called the sample \(k^{th}\) discriminant, \(k\le s\). \(\mathbf W=\displaystyle\sum_{i=1}^{g}\sum_{j=1}^{n_i}(\mathbf x_{ij}-\bar{\mathbf x}_i)(\mathbf x_{ij}-\bar{\mathbf x}_i)^T\) is the sample Within groups matrix, and \(\mathbf B=\displaystyle\sum_{i=1}^{g}(\bar{\mathbf x}_i-\bar{\mathbf x})(\bar{\mathbf x}_i-\bar{\mathbf x})^T\) is the sample Between groups matrix. Let \[\mathbf Y=\begin{bmatrix} \mathbf e_1^T\mathbf x\\ \mathbf e_2^T\mathbf x\\ \vdots\\ \mathbf e_s^T\mathbf x\\ \end{bmatrix}(s\le \text{min}(g-1,p))\] contains all the sample discriminants, then population \(\mathbf X_i\) with \(n_i\) observations have sample discriminants \[\mathbf Y_i=\underset{(s\times p)}{\begin{bmatrix} \mathbf e_1^T\\ \mathbf e_2^T\\ \vdots\\ \mathbf e_s^T\\ \end{bmatrix}}\underset{(p\times n_i)}{\mathbf X_i}=\underset{(s\times n_i)}{\begin{bmatrix} \mathbf Y_1\\ \mathbf Y_2\\ \vdots\\ \mathbf Y_s\\ \end{bmatrix}}(s\le \text{min}(g-1,p))\] and it has mean vector \[\boldsymbol\mu_{iY}=\begin{bmatrix} \mu_{iY_1}\\ \mu_{iY_2}\\ \vdots\\ \mu_{iY_s}\\ \end{bmatrix}=\begin{bmatrix} \mathbf e_1^T\\ \mathbf e_2^T\\ \vdots\\ \mathbf e_s^T\\ \end{bmatrix}\boldsymbol\mu_i=\begin{bmatrix} \mathbf e_1^T\boldsymbol\mu_i\\ \mathbf e_2^T\boldsymbol\mu_i\\ \vdots\\ \mathbf e_s^T\boldsymbol\mu_i\\ \end{bmatrix}\]. The squared distance from column components of \(\mathbf Y_i\) to its column mean \(\boldsymbol\mu_{iY}\) is \[(\mathbf y-\boldsymbol\mu_{iY})^T(\mathbf y-\boldsymbol\mu_{iY})=\sum_{j=1}^{s}[\mathbf a^T(\mathbf x-\boldsymbol\mu_i)]^2\], we can assigns \(\mathbf y\) to population \(\pi_k\) if the square of the distance from \(\mathbf y\) to \(\boldsymbol\mu_{kY}\) is smaller than the square of the distance from \(\mathbf y\) to \(\boldsymbol\mu_{iY}\) for \(i\ne k\) \[(\mathbf y-\boldsymbol\mu_{kY})^T(\mathbf y-\boldsymbol\mu_{kY})=\sum_{j=1}^{s}(y_j-\mu_{iY_j})^2=\sum_{j=1}^{s}[\mathbf a_j^T(\mathbf x-\boldsymbol\mu_k)]^2\le \sum_{j=1}^{s}[\mathbf a_j^T(\mathbf x-\boldsymbol\mu_i)]^2\] If we only use the first \(r, r\le s\) discriminants then Fisher’s Classification Procedure based on sample discriminants is allocate \(\mathbf x\) to \(\pi_k\) if \[\sum_{j=1}^{r}(\hat{y}_j-\bar{y}_{kj})^2=\sum_{j=1}^{r}[\hat{\mathbf a}_j^T(\mathbf x-\bar{\mathbf x}_k)]^2\le \sum_{j=1}^{r}[\mathbf a_j^T(\mathbf x-\bar{\mathbf x}_i)]^2\] where \(\hat{\mathbf a}_j\) is the eigenvectors of \((p\times p)\) matrix \[\mathbf W^{-1}\mathbf B=\Biggl(\displaystyle\sum_{i=1}^{g}\sum_{j=1}^{n_i}(\mathbf x_{ij}-\bar{\mathbf x}_i)(\mathbf x_{ij}-\bar{\mathbf x}_i)^T\Biggr)^{-1}\Biggl(\displaystyle\sum_{i=1}^{g}(\bar{\mathbf x}_i-\bar{\mathbf x})(\bar{\mathbf x}_i-\bar{\mathbf x})^T\Biggr)\]
\(\displaystyle\sum_{i=1}^{g}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})^T\boldsymbol\Sigma^{-1}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})\) is the sum of the squared statistical distance from the \(i^{th}\) population mean \(\boldsymbol\mu_i\) to the centroid \(\bar{\boldsymbol\mu}\). Let \((\lambda_1, \mathbf e_1),(\lambda_2, \mathbf e_2),\cdots,(\lambda_p, \mathbf e_p)\) are the eigenvalue-eigenvector pairs of matrix \(\mathbf \Sigma^{-1}\mathbf B=\displaystyle\sum_{i=1}^{g}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})\boldsymbol\Sigma^{-1}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})^T\) Then the squared separation \[\text{separation}^2=\frac{\mathbf a^T\Biggl(\displaystyle\sum_{i=1}^{g}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})\boldsymbol\Sigma^{-1}(\boldsymbol\mu_i-\bar{\boldsymbol\mu})^T\Biggr)\mathbf a}{\mathbf a^T\mathbf a}\] get its uncorrelated maximums \(\lambda_1,\cdots,\lambda_p\) when \(\mathbf a=\mathbf e_1,\cdots,\mathbf e_p\). Then the squared statistical distance of the new observation \(\mathbf x\) to the mean vector \(\boldsymbol\mu_i\) of population \(\pi_i\) is \[(\mathbf x-\boldsymbol\mu_i)^T\boldsymbol\Sigma^{-1}(\mathbf x-\boldsymbol\mu_i)=(\mathbf x-\boldsymbol\mu_i)^T\boldsymbol\Sigma^{-\frac{1}{2}}\boldsymbol\Sigma^{-\frac{1}{2}}(\mathbf x-\boldsymbol\mu_i)\\ =(\mathbf x-\boldsymbol\mu_i)^T\boldsymbol\Sigma^{-\frac{1}{2}}\mathbf E\mathbf E^T\boldsymbol\Sigma^{-\frac{1}{2}}(\mathbf x-\boldsymbol\mu_i)\] where \(\mathbf E=[\mathbf e_1,\mathbf e_2,\cdots,\mathbf e_p]\) is the orthogonal matrix whose columns are eigenvectors of matrix \(\mathbf \Sigma^{-1}\mathbf B\) When \(\mathbf a_i=\boldsymbol\Sigma^{-\frac{1}{2}}\mathbf e_i\) or \(\mathbf a_i^T=\mathbf e_i^T\boldsymbol\Sigma^{-\frac{1}{2}}\), \[\mathbf E^T\boldsymbol\Sigma^{-\frac{1}{2}}(\mathbf x-\boldsymbol\mu_i)=\begin{bmatrix} \mathbf a_1^T(\mathbf x-\boldsymbol\mu_i)\\ \mathbf a_2^T(\mathbf x-\boldsymbol\mu_i)\\ \vdots\\ \mathbf a_p^T(\mathbf x-\boldsymbol\mu_i)\\ \end{bmatrix}\] and \[(\mathbf x-\boldsymbol\mu_i)^T\boldsymbol\Sigma^{-\frac{1}{2}}\mathbf E\mathbf E^T\boldsymbol\Sigma^{-\frac{1}{2}}(\mathbf x-\boldsymbol\mu_i)=\begin{bmatrix} \mathbf a_1^T(\mathbf x-\boldsymbol\mu_i)\\ \mathbf a_2^T(\mathbf x-\boldsymbol\mu_i)\\ \vdots\\ \mathbf a_p^T(\mathbf x-\boldsymbol\mu_i)\\ \end{bmatrix}^T\begin{bmatrix} \mathbf a_1^T(\mathbf x-\boldsymbol\mu_i)\\ \mathbf a_2^T(\mathbf x-\boldsymbol\mu_i)\\ \vdots\\ \mathbf a_p^T(\mathbf x-\boldsymbol\mu_i)\\ \end{bmatrix}\\ =\sum_{j=1}^{p}[\mathbf a_j^T(\mathbf x-\boldsymbol\mu_i)]^2\] The first discriminant \(Y_1=\mathbf e_1^T\boldsymbol\Sigma^{-\frac{1}{2}}\mathbf X\) has mean \(\mu_{iY_1}=\mathbf e_1^T\boldsymbol\Sigma^{-\frac{1}{2}}\boldsymbol\mu_i\), then the squared distance from group means to central value \(\bar{\mu}_{Y_1}=\mathbf e_1^T\boldsymbol\Sigma^{-\frac{1}{2}}\bar{\boldsymbol\mu}\) is \[\sum_{i=1}^{g}(\mu_{iY_1}-\bar{\mu}_{Y_1})^2=\lambda_1\] Then \[\sum_{j=1}^{p}\sum_{i=1}^{g}(\mu_{iY_j}-\bar{\mu}_{Y_j})^2=\sum_{j=1}^{p}\lambda_j\]