9 min read

Factor analysis

Let \(\mathbf X\) is drawn from a \(p\)-variate normal distribution with \(N_p(\boldsymbol\mu, \boldsymbol\Sigma)\) distribution. The matrix of factor loadings \[\mathbf L=\begin{bmatrix} \ell_{11}&\ell_{12}&\cdots&\ell_{1m}\\ \ell_{21}&\ell_{22}&\cdots&\ell_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ \ell_{p1}&\ell_{p2}&\cdots&\ell_{pm}\\ \end{bmatrix}\] with \(\ell_{ij}\) is the loading of the \(i^{th}\) variable on the \(j^{th}\) factor.

The common factor is \[\mathbf F=\begin{bmatrix} F_1\\ F_2\\ \vdots\\ F_m\\ \end{bmatrix}\] with \(E(\mathbf F)=\underset{(m\times 1)}{\mathbf0}\), \(Var(F_j)=1,\quad (j=1,2,\cdots,m)\) and \(Cov(\mathbf F)=E(\mathbf F\mathbf F^T)=\underset{(m\times m)}{\mathbf I}\) Then the Orthogonal factor model is \[\underset{(p\times1)}{\mathbf X-\boldsymbol\mu}=\underset{(p\times m)}{\mathbf L}\underset{(m\times1)}{\mathbf F}+\underset{(p\times1)}{\boldsymbol\epsilon}\] with \(E(\boldsymbol\epsilon)=\underset{(p\times 1)}{\mathbf0}\) and \[Cov(\boldsymbol\epsilon)=E(\boldsymbol\epsilon\boldsymbol\epsilon^T)=\boldsymbol\Psi=\begin{bmatrix} \psi_1&0&\cdots&0\\ 0&\psi_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\psi_p\\ \end{bmatrix}\] with \(Var(\epsilon_i)=\psi_i\) and \(\mathbf F\) and \(\boldsymbol\epsilon\) are independent with \(Cov(\boldsymbol\epsilon,\mathbf F)=E(\boldsymbol\epsilon\mathbf F^T)=\underset{(p\times m)}{\mathbf0}\) Because \(\mathbf L\) is fixed, then \[\begin{align} \boldsymbol\Sigma=Cov(\mathbf X)&=E(\mathbf X-\boldsymbol\mu)(\mathbf X-\boldsymbol\mu)^T\\ &=E(\mathbf L\mathbf F+\boldsymbol\epsilon)(\mathbf L\mathbf F+\boldsymbol\epsilon)^T\\ &=E(\mathbf L\mathbf F+\boldsymbol\epsilon)((\mathbf L\mathbf F)^T+\boldsymbol\epsilon^T)\\ &=E\Bigl(\mathbf L\mathbf F(\mathbf L\mathbf F)^T+\boldsymbol\epsilon(\mathbf L\mathbf F)^T+\mathbf L\mathbf F\boldsymbol\epsilon^T+\boldsymbol\epsilon\boldsymbol\epsilon^T\Bigr)\\ &=\mathbf LE(\mathbf F\mathbf F^T)\mathbf L^T+\mathbf0+\mathbf0+E(\boldsymbol\epsilon\boldsymbol\epsilon^T)\\ &=\mathbf L\mathbf L^T+\boldsymbol\Psi \end{align}\] or \[Var(X_i)=\underset{Var(X_i)}{\underbrace{\sigma_{ii}}}=\mathbf L_i\mathbf L_i^T+\psi_i=\underset{\text{communality}}{\underbrace{\ell_{i1}^2+\ell_{i2}^2+\cdots+\ell_{im}^2}}+\underset{\text{specific variance}}{\underbrace{\psi_i}}\] with \(\mathbf L_i\) is the \(i^{th}\) row of \(\mathbf L\) We can denote the \(i^{th}\) communality as \(h_i^2=\ell_{i1}^2+\ell_{i2}^2+\cdots+\ell_{im}^2,\quad (i=1,2,\cdots,p)\), which is the sum of squares of the loadings of the \(i^{th}\) variable on the \(m\) common factors, and the total variance of the \(i^{th}\) variable is the sum of communality and specific variance \(\sigma_{ii}=h_i^2+\psi_i\) \[Cov(X_i,X_k)=E(\mathbf L_i^T\mathbf F+\epsilon_i)(\mathbf L_k^T\mathbf F+\epsilon_k)^T=\mathbf L_i^T\mathbf L_k=\ell_{i1}\ell_{k1}+\ell_{i2}\ell_{k2}+\cdots+\ell_{im}\ell_{km}\] \[Cov(\mathbf X,\mathbf F)=E(\mathbf X-\boldsymbol\mu)\mathbf F^T=E(\mathbf L\mathbf F+\boldsymbol\epsilon)\mathbf F^T=\mathbf LE(\mathbf F\mathbf F^T)+E(\boldsymbol\epsilon\mathbf F^T)=\mathbf L\] or \[Cov(X_i,F_j)=E(X_i-\mu_i)\mathbf F_j^T=E(\mathbf L_i^T\mathbf F+\epsilon_i)\mathbf F_j^T=\ell_{ij}\]

  1. The Principal Component (and Principal Factor) Method:
    Let \(\boldsymbol\Sigma\) has eigenvalue–eigenvector pairs \((\lambda_i,\mathbf e_i)\) with \(\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_p\ge0\) Then \[\boldsymbol\Sigma=\sum_{i=1}^{p}\lambda_i\mathbf e_i\mathbf e_i^T=\begin{bmatrix} \sqrt{\lambda_1}\mathbf e_1&\sqrt{\lambda_2}\mathbf e_2&\cdots&\sqrt{\lambda_p}\mathbf e_p\\ \end{bmatrix}\begin{bmatrix} \sqrt{\lambda_1}\mathbf e_1^T\\ \sqrt{\lambda_2}\mathbf e_2^T\\ \vdots\\ \sqrt{\lambda_p}\mathbf e_p^T\\ \end{bmatrix}\] This factor analysis model has as many factors as variables \((m=p)\) and specific variances \(\psi_i=0\) for all \(i\). The loading matrix \(\mathbf L\) has \(j^{th}\) column given by \(\sqrt{\lambda_j}\mathbf e_j\) and \(\boldsymbol\Sigma=\mathbf L\mathbf L^T\). when the last \(p-m\) eigenvalues are small we can neglect them, then \[\boldsymbol\Sigma\approx\sum_{i=1}^{m}\lambda_i\mathbf e_i\mathbf e_i^T+\boldsymbol\Psi=\begin{bmatrix} \sqrt{\lambda_1}\mathbf e_1&\sqrt{\lambda_2}\mathbf e_2&\cdots&\sqrt{\lambda_m}\mathbf e_m\\ \end{bmatrix}\begin{bmatrix} \sqrt{\lambda_1}\mathbf e_1^T\\ \sqrt{\lambda_2}\mathbf e_2^T\\ \vdots\\ \sqrt{\lambda_m}\mathbf e_m^T\\ \end{bmatrix}+\boldsymbol\Psi\\ =\underset{(p\times m)}{\mathbf L}\underset{(m\times p)}{\mathbf L^T}+\begin{bmatrix} \psi_1&0&\cdots&0\\ 0&\psi_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\psi_p\\ \end{bmatrix}\] where \[\psi_i=\sigma_{ii}-\sum_{j=1}^{m}\ell_{ij}^2,\quad (i=1,2,\cdots,p)\] When the sample covariance matrix \(\mathbf S\) has eigenvalue–eigenvector pairs \((\hat{\lambda}_i,\hat{\mathbf e}_i)\) with \(\hat{\lambda}_1\ge\hat{\lambda}_2\ge\cdots\ge\hat{\lambda}_p\ge0\) \[\mathbf S\approx\begin{bmatrix} \sqrt{\hat{\lambda}_1}\hat{\mathbf e}_1&\sqrt{\hat{\lambda}_2}\hat{\mathbf e}_2&\cdots&\sqrt{\hat{\lambda}_m}\hat{\mathbf e}_m\\ \end{bmatrix}\begin{bmatrix} \sqrt{\hat{\lambda}_1}\hat{\mathbf e}_1^T\\ \sqrt{\hat{\lambda}_2}\hat{\mathbf e}_2^T\\ \vdots\\ \sqrt{\hat{\lambda}_m}\hat{\mathbf e}_m^T\\ \end{bmatrix}+\widetilde{\boldsymbol\Psi}\\ =\underset{(p\times m)}{\widetilde{\mathbf L}}\underset{(m\times p)}{\widetilde{\mathbf L}^T}+\begin{bmatrix} \widetilde{\psi}_1&0&\cdots&0\\ 0&\widetilde{\psi}_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\widetilde{\psi}_p\\ \end{bmatrix}\] \(\widetilde{\mathbf L}\) is the the matrix of estimated factor loadings and \[\widetilde{\psi}_i=s_{ii}-\sum_{j=1}^{m}\widetilde{\ell}_{ij}^2\] \[\widetilde{h}_i^2=\widetilde{\ell}_{i1}^2+\widetilde{\ell}_{i2}^2+\cdots+\widetilde{\ell}_{im}^2\] \[Var(X_i)=s_{ii}=\widetilde{h}_i^2+\widetilde{\psi}_i\] The residual matrix is \(\mathbf S-\widetilde{\mathbf L}\widetilde{\mathbf L}^T-\widetilde{\boldsymbol\Psi}\) The contribution to the total sample variance \(s_{11}+s_{22}+\cdots+s_{pp}=tr(\mathbf S)\) from the first common factor is \[\widetilde{\ell}_{11}^2+\widetilde{\ell}_{21}^2+\cdots+\widetilde{\ell}_{p1}^2=(\sqrt{\hat{\lambda}_1}\hat{\mathbf e}_1)^T(\sqrt{\hat{\lambda}_1}\hat{\mathbf e}_1)=\hat{\lambda}_1\] Then proportion of total sample variance due to \(j^{th}\) factor is \[\frac{\hat{\lambda}_j}{tr(\mathbf S)}\] The number of common factors in the model are increased until a suitable proportion of the total sample variance has been explained.

  2. If the factor model of correlation matrix is \(\mathbf R=\mathbf L\mathbf L^T+\boldsymbol\Psi\) then the \(m\) common factors should account for the communality portions \(h_i^2\) of the diagonal elements \(r_{ii}=1=h_i^2+\psi_i\) as well as the off-diagonal elements of \(\mathbf R\). Then the reduced sample correlation matrix \(\mathbf R_r\) is: \[\mathbf R_r=\mathbf R-\boldsymbol\Psi=\begin{bmatrix} h_1^{*2}&r_{12}&\cdots&r_{1p}\\ r_{12}&h_2^{*2}&\cdots&r_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ r_{1p}&r_{2p}&\cdots&h_p^{*2}\\ \end{bmatrix}\] with \(\psi_i\) are the specific variances and the \(i^{th}\) diagonal element of \(\mathbf R\) is \(h_i^{*2}=1-\psi^{*}_i\). The reduced sample correlation matrix \(\mathbf R_r\) should be accounted for by the \(m\) common factors \(\mathbf R\doteq\mathbf L_r^*\mathbf L_r^{*T}\) \[\mathbf L_r^*=\begin{bmatrix} \sqrt{\hat{\lambda}_1^*}\hat{\mathbf e}_1^*&\sqrt{\hat{\lambda}_2^*}\hat{\mathbf e}_2^*&\cdots&\sqrt{\hat{\lambda}_m^*}\hat{\mathbf e}_m^*\\ \end{bmatrix}\] \[\psi_i^*=1-\sum_{j=1}^{m}\ell_{ij}^{*2}=1-\widetilde{h}_i^{*2}\] with \(\widetilde{h}_i^{*2}\) the estimated communalities.

  3. The maximum likelihood method: When \(\mathbf X_j\) are random sample from \(N_p(\boldsymbol\mu, \boldsymbol\Sigma)\), and \(\mathbf F_j\) and \(\boldsymbol\epsilon_j\) are jointly normal, then the observations \(\underset{(p\times1)}{\mathbf X_j-\boldsymbol\mu}=\underset{(p\times m)}{\mathbf L}\underset{(m\times1)}{\mathbf F_j}+\underset{(p\times1)}{\boldsymbol\epsilon_j}\) are normal distributed, where \(Cov(\mathbf X)=\boldsymbol\Sigma=\mathbf L\mathbf L^T+\boldsymbol\Psi\) is the covariance matrix for the \(m\) common factor model. Then the likelihood is \[\begin{align} L(\boldsymbol\mu, \boldsymbol\Sigma)&=\prod_{j=1}^{n}\Biggl[\frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}}e^{-\frac{1}{2}(\mathbf x_j-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\mathbf x_j-\boldsymbol \mu)}\Biggr]\\ &=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol \mu)^T\mathbf\Sigma^{-1}(\mathbf x_j-\boldsymbol \mu)}\\ &=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\mathbf\Sigma^{-1}\sum_{j=1}^{n}(\mathbf x_j-\boldsymbol \mu)^T(\mathbf x_j-\boldsymbol\mu)\Biggr]}\\ &=\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\mathbf\Sigma^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T+n(\overline{\mathbf x}-\boldsymbol\mu)(\overline{\mathbf x}-\boldsymbol\mu)^T\Biggr)\Biggr]}\\ \end{align}\] with \(\boldsymbol\Sigma=\mathbf L\mathbf L^T+\boldsymbol\Psi\) We can define \(\mathbf L\) by imposing the uniqueness condition \(\mathbf L^T\boldsymbol\Psi^{-1}\mathbf L=\boldsymbol\Delta\) with \(\boldsymbol\Delta\) a diagonal matrix. The maximum likelihood estimators of the likelihood is \(\hat{\mathbf L}\),\(\hat{\boldsymbol\Psi}\) and \(\boldsymbol\mu=\bar{\mathbf x}\) subject to \(\hat{\mathbf L}^T\hat{\boldsymbol\Psi}^{-1}\hat{\mathbf L}=\boldsymbol\Delta\). The maximum likelihood estimates of the communalities are \(\hat{h_i^2}=\hat{\ell}_{i1}^2+\hat{\ell}_{i2}^2+\cdots+\hat{\ell}_{im}^2 \quad(i=1,2,\cdots,p)\) and the proportion of total sample variance due to \(j^{th}\) factor is \[\frac{\hat{\ell}_{1j}^2+\hat{\ell}_{2j}^2+\cdots+\hat{\ell}_{pj}^2}{tr(\mathbf S)}\] If the random variables are standardized with \[Z_i=\frac{X_i-\mu_i}{\sqrt{\sigma_{ii}}}\], with matrix notation \(\mathbf Z=(\mathbf D^{\frac{1}{2}})^{-1}(\mathbf X-\boldsymbol\mu)\) where \[\mathbf D=\begin{bmatrix} \sigma_{11}&0&0&\cdots&0\\ 0&\sigma_{22}&0&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&\sigma_{pp}\\ \end{bmatrix}\] Then \(E(\mathbf Z)=\mathbf0\) and \[Cov(\mathbf Z)=(\mathbf D^{-\frac{1}{2}})\boldsymbol\Sigma(\mathbf D^{-\frac{1}{2}})=\begin{bmatrix} \frac{\sigma_{11}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{11}}}&\frac{\sigma_{12}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}}&\frac{\sigma_{13}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{33}}}&\cdots&\frac{\sigma_{1p}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{pp}}}\\ \frac{\sigma_{12}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{22}}}&\frac{\sigma_{22}}{\sqrt{\sigma_{22}}\sqrt{\sigma_{22}}}&\frac{\sigma_{23}}{\sqrt{\sigma_{22}}\sqrt{\sigma_{33}}}&\cdots&\frac{\sigma_{2p}}{\sqrt{\sigma_{22}}\sqrt{\sigma_{pp}}}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \frac{\sigma_{1p}}{\sqrt{\sigma_{11}}\sqrt{\sigma_{pp}}}&\frac{\sigma_{2p}}{\sqrt{\sigma_{22}}\sqrt{\sigma_{pp}}}&\frac{\sigma_{3p}}{\sqrt{\sigma_{33}}\sqrt{\sigma_{pp}}}&\cdots&\frac{\sigma_{pp}}{\sqrt{\sigma_{pp}}\sqrt{\sigma_{pp}}}\\ \end{bmatrix}=\boldsymbol\rho\\ =(\mathbf D^{-\frac{1}{2}}\mathbf L)(\mathbf D^{-\frac{1}{2}}\mathbf L)^T+\mathbf D^{-\frac{1}{2}}\boldsymbol\Psi\mathbf D^{-\frac{1}{2}}\] Thus \(\boldsymbol\rho\) can be factorized with loading matrix \(\mathbf L_z=\mathbf D^{-\frac{1}{2}}\mathbf L\) and specific variance matrix \(\boldsymbol\Psi_z=\mathbf D^{-\frac{1}{2}}\boldsymbol\Psi\mathbf D^{-\frac{1}{2}}\). If \(\hat{\mathbf D}^{-\frac{1}{2}}\) and \(\hat{\mathbf L}\) are the maximum likelihood estimators of \(\mathbf D^{-\frac{1}{2}}\) and \(\mathbf L\), then the maximum likelihood estimators of \(\boldsymbol\rho\) is \[\hat{\boldsymbol\rho}=(\hat{\mathbf D}^{-\frac{1}{2}}\hat{\mathbf L})(\hat{\mathbf D}^{-\frac{1}{2}}\hat{\mathbf L})^T+\hat{\mathbf D}^{-\frac{1}{2}}\hat{\boldsymbol\Psi}\hat{\mathbf D}^{-\frac{1}{2}}=\hat{\mathbf L}_z\hat{\mathbf L}_z^T+\hat{\boldsymbol\Psi}_z\] The proportion of total standardized sample variance due to \(j^{th}\) factor is \[\frac{\hat{\ell}_{1j}^2+\hat{\ell}_{2j}^2+\cdots+\hat{\ell}_{pj}^2}{tr(\boldsymbol\rho)}=\frac{\hat{\ell}_{1j}^2+\hat{\ell}_{2j}^2+\cdots+\hat{\ell}_{pj}^2}{p}\]

  4. Test for the Number of Common Factors:
    To test the adequacy of the \(m\) common factor model \[\underset{(p\times1)}{\mathbf X-\boldsymbol\mu}=\underset{(p\times m)}{\mathbf L}\underset{(m\times1)}{\mathbf F}+\underset{(p\times1)}{\boldsymbol\epsilon}\] The NULL hypothesis is \[H_0:\underset{(p\times p)}{\boldsymbol\Sigma}=Cov(\mathbf X)=E(\mathbf X-\boldsymbol\mu)(\mathbf X-\boldsymbol\mu)^T=\underset{(p\times m)}{\mathbf L}\underset{(m\times p)}{\mathbf L^T}+\boldsymbol\Psi\] versus \[H_1: \underset{(p\times p)}{\boldsymbol\Sigma}=Cov(\mathbf X)=\underset{(p\times q)}{\mathbf L}\underset{(q\times p)}{\mathbf L^T}+\boldsymbol\Psi, \quad (q\ne m)\] Under \(H_0\), the maximum likelihood estimates of \(\mathbf\Sigma\) is \(\hat{\mathbf\Sigma}=\hat{\mathbf L}\hat{\mathbf L}^T+\hat{\boldsymbol\Psi}\) and \(\hat{\mathbf L}\),\(\hat{\boldsymbol\Psi}\) are the maximum likelihood estimates of \(\mathbf L\),\(\boldsymbol\Psi\). Then the maximum likelihood function is \[\frac{1}{(2\pi)^{\frac{np}{2}}|\mathbf\Sigma|^{n/2}}e^{-\frac{1}{2}tr\Biggl[\mathbf\Sigma^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T+n(\overline{\mathbf x}-\boldsymbol\mu)(\overline{\mathbf x}-\boldsymbol\mu)^T\Biggr)\Biggr]}\\ =\frac{1}{(2\pi)^{\frac{np}{2}}|\hat{\mathbf L}\hat{\mathbf L}^T+\hat{\boldsymbol\Psi}|^{n/2}}e^{-\frac{1}{2}tr\Biggl[(\hat{\mathbf L}\hat{\mathbf L}^T+\hat{\boldsymbol\Psi})^{-1}\Biggl(\sum_{j=1}^{n}(\mathbf x_j-\overline{\mathbf x})(\mathbf x_j-\overline{\mathbf x})^T\Biggr)\Biggr]}\\ =\frac{1}{(2\pi)^{\frac{np}{2}}|\hat{\mathbf L}\hat{\mathbf L}^T+\hat{\boldsymbol\Psi}|^{n/2}}e^{-\frac{1}{2}tr\Biggl[(\hat{\mathbf L}\hat{\mathbf L}^T+\hat{\boldsymbol\Psi})^{-1}n\mathbf S_n\Biggr]}\] with the maximum likelihood estimates of \(\boldsymbol\mu\) is \(\boldsymbol\mu=\bar{\mathbf x}\)

  5. Factor Rotation
    The \((p\times m)\) matrix of rotated loadings is \(\hat{\mathbf L}^*=\hat{\mathbf L}\mathbf T\), where \(\hat{\mathbf L}\) is the \((p\times m)\) matrix of estimated factor loadings and \(\mathbf T\mathbf T^T=\mathbf T^T\mathbf T=\mathbf I\). If \(\beta\) is the rotation angle, \[cos(\alpha+\beta)=cos\alpha cos\beta-sin\alpha sin\beta=xcos\beta-ysin\beta=\begin{bmatrix} x&y \end{bmatrix}\begin{bmatrix} cos\beta\\ -sin\beta \end{bmatrix}\] and \[sin(\alpha+\beta)=sin\alpha cos\beta+cos\alpha sin\beta=xsin\beta+ycos\beta=\begin{bmatrix} x&y \end{bmatrix}\begin{bmatrix} sin\beta\\ cos\beta \end{bmatrix}\] Then \[\mathbf T=\begin{bmatrix} cos\beta&sin\beta\\ -sin\beta&cos\beta \end{bmatrix}\] which denotes the counterclockwise rotation of the \((x,y)\) dots or clockwise rotation of the coordinate axes. Then \[\hat{\boldsymbol\Sigma}=\mathbf S_n=\hat{\mathbf L}\hat{\mathbf L}^T+\hat{\boldsymbol\Psi}=\hat{\mathbf L}\mathbf T\mathbf T^T\hat{\mathbf L}^T+\hat{\boldsymbol\Psi}=\hat{\mathbf L}^*\hat{\mathbf L}^{*T}+\hat{\boldsymbol\Psi}\], which indicates that the residual matrix remains unchanged \[\mathbf S_n-\hat{\mathbf L}\hat{\mathbf L}^T-\hat{\boldsymbol\Psi}=\mathbf S_n-\hat{\mathbf L}^*\hat{\mathbf L}^{*T}-\hat{\boldsymbol\Psi}\] and the specific variances \(\hat{\psi}_i\) and the communalities \(\hat{h^2}_i\) are unaltered. The varimax (or normal varimax) criterion: \[V=\frac{1}{p}\sum_{j=1}^{m}\Biggl[\sum_{i=1}^{p}\Biggl(\frac{\hat{\ell}_{ij}^*}{\hat{h}_i}\Biggr)^4-\Biggl(\sum_{i=1}^{p}\frac{\hat{\ell}_{ij}^{*2}}{\hat{h}_i^2}\Biggr)^2/p\Biggr]\] which selects the orthogonal transformation \(\mathbf T\) that makes \(V\) as large as possible.

  6. Factor Scores
    The estimated values of the \(j^{th}\) common factors \[\mathbf F=\begin{bmatrix} F_1\\ F_2\\ \vdots\\ F_m\\ \end{bmatrix}\] are \(\hat{\mathbf f_j},\quad j=1,2,\cdots,n\) which are called factor scores.

  7. PCA and SVD decomposition
    Consider an \(n\times m\) data matrix \(\mathbf X\). The singular value decomposition is \(\underset{(n\times m)}{\mathbf X}=\underset{(n\times n)}{\mathbf U}\underset{(n\times m)}{\mathbf S}\underset{(m\times m)}{\mathbf V^T}\), where where \(\mathbf V\) has as its columns the (normalized) eigenvectors of \(\mathbf X^T\mathbf X\). A rotation is a change of coordinates and amounts to writing the above equality as: \[\mathbf X=\mathbf U\mathbf S(\mathbf T\mathbf T^T)\mathbf V^T=(\mathbf U\mathbf S\mathbf T)(\mathbf T^T\mathbf V^T)=\underset{(n\times m)}{\mathbf U^*}\underset{(m\times m)}{\mathbf V^*}\] with \(\mathbf T\) being an orthogonal matrix chosen to achieve a \(\mathbf V^*\) close to sparse (maximum contrast between entries). Because we never rotate all PC. Rather, we consider a subset of \(k<m\) which provides still a decent rank-\(k\) approximation of \(\mathbf X\) \[\mathbf X\approx\mathbf U_k\mathbf S_k\mathbf V_k^T\], so the rotated solution is now \[\mathbf X\approx(\underset{(n\times n)}{\mathbf U_k}\underset{(n\times k)}{\mathbf S_k}\underset{(k\times k)}{\mathbf T_k})(\underset{(k\times k)}{\mathbf T_k^T}\underset{(k\times n)}{\mathbf V_k^T})=\underset{(n\times k)}{\mathbf U_k^*}\underset{(k\times n)}{\mathbf V_k^*}\] where \(\mathbf V_k^*\) is a \(k\times n\) matrix.

If rotation is applied to loadings, then there are three easy ways to compute varimax-rotated PCs in R: (1)

irisX <- iris[,1:4]      # Iris data
ncomp <- 2
irisX[1:5,]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
#ran PCA on 4 variables and selected the top 2 PCs using psych::principal
#Note that it returns standardized scores, i.e. all PCs have unit variance
pca_iris_rotated <- psych::principal(irisX, rotate="varimax", nfactors=ncomp, scores=TRUE)
print(pca_iris_rotated$scores[1:5,])  # Scores returned by principal()
##            RC1        RC2
## [1,] -1.083475  0.9067262
## [2,] -1.377536 -0.2648876
## [3,] -1.419832  0.1165198
## [4,] -1.471607 -0.1474634
## [5,] -1.095296  1.0949536
#ran PCA on 4 variables and selected the top 2 PCs using prcomp
pca_iris <- prcomp(irisX, center=T, scale=T)
#   the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the # covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix).
pca_iris$sdev
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
sqrt(eigen(cor(irisX))$values)
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
# $rotation is the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.
pca_iris$rotation
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
eigen(cor(irisX))$vectors
##            [,1]        [,2]       [,3]       [,4]
## [1,]  0.5210659 -0.37741762  0.7195664  0.2612863
## [2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
## [3,]  0.5804131 -0.02449161 -0.1421264 -0.8014492
## [4,]  0.5648565 -0.06694199 -0.6342727  0.5235971
#raw loading are eigenvectors scaled by the square roots of the respective eigenvalues.
(rawLoadings <- pca_iris$rotation[,1:ncomp] %*% diag(pca_iris$sdev, ncomp, ncomp))
##                    [,1]        [,2]
## Sepal.Length  0.8901688 -0.36082989
## Sepal.Width  -0.4601427 -0.88271627
## Petal.Length  0.9915552 -0.02341519
## Petal.Width   0.9649790 -0.06399985
#We can manually use varimax function to rotate the loadings, 
(rotatedLoadings <- varimax(rawLoadings, normalize = TRUE)$loadings)
## 
## Loadings:
##              [,1]   [,2]  
## Sepal.Length  0.959       
## Sepal.Width  -0.145 -0.985
## Petal.Length  0.944  0.304
## Petal.Width   0.932  0.257
## 
##                 [,1]  [,2]
## SS loadings    2.702 1.130
## Proportion Var 0.676 0.283
## Cumulative Var 0.676 0.958
# then use the new rotated loadings to obtain the scores; one needs to multiple the data with the transposed pseudo-inverse of the rotated #loadings using pracma::pinv() function. This will also yield standardized scores.
invLoadings <- t(pracma::pinv(rotatedLoadings))
scores <- scale(irisX) %*% invLoadings
print(scores[1:5,])                   # Scores computed via rotated loadings
##           [,1]       [,2]
## [1,] -1.083475 -0.9067262
## [2,] -1.377536  0.2648876
## [3,] -1.419832 -0.1165198
## [4,] -1.471607  0.1474634
## [5,] -1.095296 -1.0949536
pca_iris <- prcomp(irisX, center=T, scale=T)
rawLoadings <- pca_iris$rotation[,1:ncomp] %*% diag(pca_iris$sdev, ncomp, ncomp)
#One can use varimax function to rotate the loadings, and then use the $rotmat rotation matrix to #rotate the standardized scores obtained with prcomp.
#pca_iris$x contains the PCs 
scores <- scale(pca_iris$x[,1:2]) %*% varimax(rawLoadings, normalize = TRUE)$rotmat
print(scores[1:5,])                   # Scores computed via rotating the scores
##           [,1]       [,2]
## [1,] -1.083475 -0.9067262
## [2,] -1.377536  0.2648876
## [3,] -1.419832 -0.1165198
## [4,] -1.471607  0.1474634
## [5,] -1.095296 -1.0949536