--- title: "Simulation: effects of subgroup differences on principal component analyses" author: "Chris Evans" date: "Thursday, July 31, 2014" output: html_document --- ## Document format and generation## This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . ## General introduction## This started as a way of reassuring myself that my anxieties about an exploratory factor analysis (EFA) of data from the COPES scale where the data came from 21 different therapeutic communities (TCs) and from staff and non-staff members of the those communities and included some repeat data as well as one off data _might_ find an EFA picture of the psychometric structure of the replies that relected more the differences between the units and perhaps the staff/non-staff difference and even perhaps the initial/later issue. It's turned into a bit of an experiment in using the latest "weaving" capabilities within Rmarkdown and Rstudio to mix text and code and generate HTML, PDF or Word documents. ## Statistical issue ## There's a simple bivariate equivalent of this EFA issue: that the correlation or regression beween two variables may be the same in two groups but the groups so different in mean values on the two variables that the overall correlation or regression between the two in the pooled dataset is quite different from that in each group. ```{r bivariate1, fig.width=10, fig.height=8} options(width=120) nGrps <- 2 nInGrp <- 500 grp2XInc <- 3 grp2YDec <- 7 ratio <- 3 slope <- 1.05 grp1x <- rnorm(nInGrp) grp1err <- rnorm(nInGrp)/ratio grp2x <- rnorm(nInGrp) + grp2XInc grp2err <- rnorm(nInGrp)/ratio grp1y1 <- slope*grp1x + grp1err grp2y <- slope*grp2x + grp2err - grp2YDec xMeanDiff <- mean(grp2x) - mean(grp1x) yMeanDiff <- mean(grp2y) - mean(grp1y1) grp1Pearson <- cor(grp1x,grp1y1) grp2Pearson <- cor(grp1x,grp1y1) x <- c(grp1x,grp2x) y <- c(grp1y1,grp2y) allPearson <- cor(x,y) xmin <- floor(min(x)) xmax <- ceiling(max(x)) ymin <- floor(min(y)) ymax <- ceiling(max(y)) par(col=1) plot(grp1x,grp1y1, xlim=c(xmin,xmax), ylim=c(ymin,ymax), xlab="x", ylab="y", type="n") par(col="blue") points(grp1x,grp1y1, xlim=c(-4,5), ylim=c(-15,4)) abline(lm(grp1y1 ~ grp1x)) par(col="red") points(grp2x,grp2y, xlim=c(-4,5), ylim=c(-15,4)) abline(lm(grp2y ~ grp2x)) par(col=1) abline(lm(y ~ x)) ``` So that used R's pseudorandom number generator to generate Gaussian x and y variables in `r nGrps` groups each with `r nInGrp` data points and with exactly the same relationship between the x and y variables but with a substantial mean shift on each variable between the groups. The population value difference is a x increase of `r grp2XInc` and a y decrease of `r grp2YDec`. The sample mean x difference is `r round(xMeanDiff,2)` and the y difference is `r round(yMeanDiff,2)`. The correlation between x and y in group 1 is `r round(grp1Pearson,2)` and that in group 2 is `r round(grp2Pearson,2)` but the overall Pearson correlation between x and y pooled across the two groups is very different at `r round(allPearson,2)`. It's pretty easy to see how this happens from the plot. ## OK, let's go to the simple multivariate case ## This is really very similar to the above but this time I have generated four more y variables within each group, still keeping the x/y relationship the same for all y and within both groups. However, I have arranged things so that the group differences for Y1, Y2 and Y3 are the same (and the same as they were in the bivariate example above) but the group differences for Y4 and Y5 are rather different. ```{r multivariate} ### group 1 grp1err2 <- rnorm(nInGrp)/ratio grp1y2 <- slope*grp1x + grp1err2 grp1err3 <- rnorm(nInGrp)/ratio grp1y3 <- slope*grp1x + grp1err3 grp1err4 <- rnorm(nInGrp)/ratio grp1y4 <- slope*grp1x + grp1err4 grp1err5 <- rnorm(nInGrp)/ratio grp1y5 <- slope*grp1x + grp1err5 ### group 2 grp2err2 <- rnorm(nInGrp)/ratio grp2YDec2 <- grp2YDec grp2y2 <- slope*grp2x + grp2err2 - grp2YDec2 grp2err3 <- rnorm(nInGrp)/ratio grp2YDec3 <- grp2YDec grp2y3 <- slope*grp2x + grp2err3 - grp2YDec3 grp2err4 <- rnorm(nInGrp)/ratio grp2YDec4 <- -4 grp2y4 <- slope*grp2x + grp2err4 - grp2YDec4 grp2err5 <- rnorm(nInGrp)/ratio grp2YDec5 <- -6 grp2y5 <- slope*grp2x + grp2err5 - grp2YDec5 grp1yAll <- cbind(grp1y1,grp1y2,grp1y3,grp1y4,grp1y5) grp2yAll <- cbind(grp2y,grp2y2,grp2y3,grp2y4,grp2y5) yAll <- rbind(grp1yAll,grp2yAll) ``` The plots below show the rather different overall x/y correlations this gives for Y4 and Y5 from Y1, Y2 and Y3. ```{r multivariate2, fig.width=10, fig.height=6.5} par(mfrow=c(1,5)) par(col=1) combPlotFun <- function(x,y,title,xmin,xmax,ymin,ymax) { par(col=1) nGrp <- length(x)/2 grp1x <- x[1:nGrp] grp1y1 <- y[1:nGrp] grp2x <- x[nGrp+1:length(x)] grp2y <- y[nGrp+1:length(y)] plot(x,y, xlim=c(xmin,xmax), ylim=c(ymin,ymax), xlab="x", ylab="y", main=title, type="n") par(col="blue") points(grp1x,grp1y1) abline(lm(y ~ x)) par(col="red") points(grp2x,grp2y, xlim=c(-4,5), ylim=c(-15,4)) abline(lm(grp2y ~ grp2x)) par(col=1) abline(lm(y ~ x)) } ymin <- floor(min(yAll)) ymax <- ceiling(max(yAll)) for (i in 1:5) { combPlotFun(x,yAll[,i],paste("Y",i,sep=""),xmin,xmax,ymin,ymax) } ``` I'm only going to use these five y variables in the EFA. I'm going to speed things up and keep them simple by actually doing a principal component analysis (PCA). There's ample evidence that the differences beween PCA loading matrices and those for "true EFAs" of whatever type will be small. Their correlation matrix for the five y variables within group 1 is: ```{r multivariate3, echo=FALSE} par(mfrow=c(1,1)) round(cor(grp1yAll),2) ``` and within group 2 it is: ```{r multivariate4, echo=FALSE} round(cor(grp2yAll),2) ``` But the correlation for the pooled data is rather different: ```{r multivariate5, echo=FALSE} round(cor(yAll),2) ``` Of course, this produces very different PCA results for the pooled data from the analyses for each group separately. This shows the scree plots for the two groups separately then for the compbined data. ```{r PCA1, fig.width=10, fig.height=5, echo=FALSE} grp1PCA <- prcomp(grp1yAll) grp2PCA <- prcomp(grp2yAll) allPCA <- prcomp(yAll) par(mfrow=c(1,3)) par(lwd=2) plot(grp1PCA,type="lines") plot(grp2PCA,type="lines") plot(allPCA,type="lines") par(lwd=1) par(mfrow=c(1,1)) ``` It's easy to see that the pooled data has a scree plot supporting a two component solution whereas the plots for the within group analyses show a very clear unicomponent solution is fine to capture the systematic covariance in the data. This is nice and clear in the loadings: here for group 1. ```{r PCA2} grp1PCA$rotation ``` And here for the pooled data. (I've omitted group 2 as it's essentially identical to group 1.) ```{r PCA3} allPCA$rotation ``` Varimax rotation of first two components from the pooled data: ```{r varimax1} varimax(allPCA$rotation[,1:2]) ``` # Discussion I thinks it's very clear that factor structure from a mixed group pooled sample can reflect group mean differences and show a factor structure that is different from that in any of the subgroups even when all data from each subgroup actually has the same factor structure (quite different from that emerging from analysis of the pooled data). Caveat extractor! I showed only the bivariate case in which there is the same linear regression linking the two variables within each group but a quite different regression emerges from (unthinking) analysis of the pooled data. It's clearly equally possible to have two different regressions in the two groups and a regression that has a different slope from either and it's possible to find a statistically non-significant and near zero regression/correlation if the effect of the group mean differences neatly cancels a shared correlation in both groups just as equal and opposite regression slopes in two samples will cancel to show a zero correlation in the pooled data (assuming equal sized groups). All manner of misleading findings are clearly possible. It's pretty easy to see the problems with sensible inspection of scattergrams in the bivariate case but hard to do so with multivariate EFA and with more than few groups.