Exploratory Factor Analysis

Example exercise: “Residential location satisfaction in the Lisbon metropolitan area”

The aim of this study was to examine the perception of households towards their residential location considering several land use and accessibility factors as well as household socioeconomic and attitudinal characteristics.

Reference: Martinez, L. G., de Abreu e Silva, J., & Viegas, J. M. (2010). Assessment of residential location satisfaction in the Lisbon metropolitan area, TRB (No. 10-1161).

Your task: Analyze the data and create meaningful latent factors.

Data

Variables:

DWELCLAS: Classification of the dwelling;
INCOME: Income of the household;
CHILD13: Number of children under 13 years old;
H18: Number of household members above 18 years old;
HEMPLOY: Number of household members employed;
HSIZE: Household size;
IAGE: Age of the respondent;
ISEX: Sex of the respondent;
NCARS: Number of cars in the household;
AREA: Area of the dwelling;
BEDROOM: Number of bedrooms in the dwelling;
PARK: Number of parking spaces in the dwelling;
BEDSIZE: BEDROOM/HSIZE;
PARKSIZE: PARK/NCARS;
RAGE10: 1 if Dwelling age <= 10;
TCBD: Private car distance in time to CBD;
DISTTC: Euclidean distance to heavy public transport system stops;
TWCBD: Private car distance in time of workplace to CBD;
TDWWK: Private car distance in time of dwelling to work place;
HEADH: 1 if Head of the Household;
POPDENS: Population density per hectare;
EQUINDEX: Number of undergraduate students/Population over 20 years old (500m)

Rules of thumb:

At least 10 variables
n < 50 (Unacceptable); n > 200 (recommended); 10 observations per variable or higher
It is recommended to use continuous variables. If your data contains categorical variables, you should transform them to dummy variables.

Assumptions:

Normality;
linearity;
Homoscedasticity (some multicollinearity is desirable);
Correlations between variables < 0.3 (not appropriate to use Factor Analysis)

Let’s start!

Import Libraries

library(foreign) # Library used to read SPSS files
library(nFactors) # Library used for factor analysis
library(tidyverse) # Library used in data science to perform exploratory data analysis
library(summarytools) # Library used for checking the summary of the dataset
library(psych) # Library used for factor analysis
library(GPArotation) # Library used for factor analysis
library(corrplot) # Library used for correlation analysis
library(car) #For linear regression and VIF calculation (testing multicollinearity)

Get to know your dataset

Import dataset

df <- read.spss("Data/example_fact.sav", to.data.frame = T) #transforms a list into a data.frame directly

Select a dataset with the first variables explained above

df = df[,1:24]

Make ID as row names or case number

df<-data.frame(df, row.names = 1)

Check summary statistics of variables

descriptive_stats <- dfSummary(df)
view(descriptive_stats)

Data Frame Summary

df

Dimensions: 470 x 23
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)
1	DWELCLAS [numeric]	Mean (sd) : 5.1 (1.3) min < med < max: 1 < 5 < 7 IQR (CV) : 2 (0.2)	1 : 5 ( 1.1%) 2 : 14 ( 3.0%) 3 : 31 ( 6.6%) 4 : 75 (16.0%) 5 : 130 (27.7%) 6 : 162 (34.5%) 7 : 53 (11.3%)
2	INCOME [numeric]	Mean (sd) : 4259.6 (3001.8) min < med < max: 700 < 2750 < 12500 IQR (CV) : 2000 (0.7)	700 : 20 ( 4.3%) 1500 : 96 (20.4%) 2750 : 142 (30.2%) 4750 : 106 (22.6%) 7500 : 75 (16.0%) 12500 : 31 ( 6.6%)
3	CHILD13 [numeric]	Mean (sd) : 0.4 (0.8) min < med < max: 0 < 0 < 4 IQR (CV) : 0 (2)	0 : 353 (75.1%) 1 : 62 (13.2%) 2 : 41 ( 8.7%) 3 : 13 ( 2.8%) 4 : 1 ( 0.2%)
4	H18 [numeric]	Mean (sd) : 2.1 (0.9) min < med < max: 0 < 2 < 6 IQR (CV) : 0.8 (0.4)	0 : 1 ( 0.2%) 1 : 112 (23.8%) 2 : 239 (50.9%) 3 : 77 (16.4%) 4 : 35 ( 7.4%) 5 : 3 ( 0.6%) 6 : 3 ( 0.6%)
5	HEMPLOY [numeric]	Mean (sd) : 1.5 (0.7) min < med < max: 0 < 2 < 5 IQR (CV) : 1 (0.5)	0 : 39 ( 8.3%) 1 : 171 (36.4%) 2 : 237 (50.4%) 3 : 21 ( 4.5%) 4 : 1 ( 0.2%) 5 : 1 ( 0.2%)
6	HSIZE [numeric]	Mean (sd) : 2.6 (1.3) min < med < max: 1 < 2 < 7 IQR (CV) : 2 (0.5)	1 : 104 (22.1%) 2 : 147 (31.3%) 3 : 96 (20.4%) 4 : 96 (20.4%) 5 : 20 ( 4.3%) 6 : 5 ( 1.1%) 7 : 2 ( 0.4%)
7	AVADUAGE [numeric]	Mean (sd) : 37.8 (9.9) min < med < max: 0 < 36 < 78 IQR (CV) : 12.7 (0.3)	126 distinct values
8	IAGE [numeric]	Mean (sd) : 36.9 (11.6) min < med < max: 0 < 34 < 78 IQR (CV) : 15 (0.3)	53 distinct values
9	ISEX [numeric]	Min : 0 Mean : 0.5 Max : 1	0 : 214 (45.5%) 1 : 256 (54.5%)
10	NCARS [numeric]	Mean (sd) : 1.7 (0.9) min < med < max: 0 < 2 < 5 IQR (CV) : 1 (0.5)	0 : 23 ( 4.9%) 1 : 182 (38.7%) 2 : 193 (41.1%) 3 : 56 (11.9%) 4 : 13 ( 2.8%) 5 : 3 ( 0.6%)
11	AREA [numeric]	Mean (sd) : 133 (121.5) min < med < max: 30 < 110 < 2250 IQR (CV) : 60 (0.9)	76 distinct values
12	BEDROOM [numeric]	Mean (sd) : 2.9 (1.1) min < med < max: 0 < 3 < 7 IQR (CV) : 1 (0.4)	0 : 1 ( 0.2%) 1 : 28 ( 6.0%) 2 : 153 (32.6%) 3 : 180 (38.3%) 4 : 73 (15.5%) 5 : 26 ( 5.5%) 6 : 7 ( 1.5%) 7 : 2 ( 0.4%)
13	PARK [numeric]	Mean (sd) : 0.8 (1) min < med < max: 0 < 1 < 4 IQR (CV) : 1 (1.2)	0 : 224 (47.7%) 1 : 136 (28.9%) 2 : 84 (17.9%) 3 : 18 ( 3.8%) 4 : 8 ( 1.7%)
14	BEDSIZE [numeric]	Mean (sd) : 1.4 (0.8) min < med < max: 0 < 1 < 5 IQR (CV) : 0.7 (0.6)	22 distinct values
15	PARKSIZE [numeric]	Mean (sd) : 0.5 (0.6) min < med < max: 0 < 0.2 < 3 IQR (CV) : 1 (1.2)	13 distinct values
16	RAGE10 [numeric]	Min : 0 Mean : 0.2 Max : 1	0 : 356 (75.7%) 1 : 114 (24.3%)
17	TCBD [numeric]	Mean (sd) : 24.7 (16.2) min < med < max: 0.8 < 23.8 < 73.3 IQR (CV) : 25.7 (0.7)	434 distinct values
18	DISTHTC [numeric]	Mean (sd) : 1347 (1815.8) min < med < max: 49 < 719 < 17732.7 IQR (CV) : 1125 (1.3)	434 distinct values
19	TWCBD [numeric]	Mean (sd) : 17 (16.2) min < med < max: 0.3 < 9.9 < 67.8 IQR (CV) : 20 (1)	439 distinct values
20	TDWWK [numeric]	Mean (sd) : 23.5 (17.1) min < med < max: 0 < 22.2 < 80.7 IQR (CV) : 23.6 (0.7)	414 distinct values
21	HEADH [numeric]	Min : 0 Mean : 0.9 Max : 1	0 : 64 (13.6%) 1 : 406 (86.4%)
22	POPDENS [numeric]	Mean (sd) : 92 (58.2) min < med < max: 0 < 83.2 < 255.6 IQR (CV) : 89.2 (0.6)	431 distinct values
23	EDUINDEX [numeric]	Mean (sd) : 0.2 (0.1) min < med < max: 0 < 0.2 < 0.7 IQR (CV) : 0.2 (0.6)	434 distinct values

Note: I used a different library of the MLR chapter for perfoming the summary statistics. “R” allows you to do the same or similar tasks with different packages.

Take a look at the main characteristics of the dataset

class(df) #type of data

## [1] "data.frame"

str(df)

## 'data.frame':    470 obs. of  23 variables:
##  $ DWELCLAS: num  5 6 6 5 6 6 4 2 6 5 ...
##  $ INCOME  : num  7500 4750 4750 7500 2750 1500 12500 1500 1500 1500 ...
##  $ CHILD13 : num  1 0 2 0 1 0 0 0 0 0 ...
##  $ H18     : num  2 1 2 3 1 3 3 4 1 1 ...
##  $ HEMPLOY : num  2 1 2 2 1 2 0 2 1 1 ...
##  $ HSIZE   : num  3 1 4 4 2 3 3 4 1 1 ...
##  $ AVADUAGE: num  32 31 41.5 44.7 33 ...
##  $ IAGE    : num  32 31 42 52 33 47 62 21 34 25 ...
##  $ ISEX    : num  1 1 0 1 0 1 1 0 0 0 ...
##  $ NCARS   : num  2 1 2 3 1 1 2 3 1 1 ...
##  $ AREA    : num  100 90 220 120 90 100 178 180 80 50 ...
##  $ BEDROOM : num  2 2 4 3 2 2 5 3 2 1 ...
##  $ PARK    : num  1 1 2 0 0 0 2 0 0 1 ...
##  $ BEDSIZE : num  0.667 2 1 0.75 1 ...
##  $ PARKSIZE: num  0.5 1 1 0 0 0 1 0 0 1 ...
##  $ RAGE10  : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ TCBD    : num  36.79 15.47 24.1 28.72 7.28 ...
##  $ DISTHTC : num  629 551 548 2351 698 ...
##  $ TWCBD   : num  10 15.5 12.71 3.17 5.36 ...
##  $ TDWWK   : num  31.1 0 20.4 32.9 13 ...
##  $ HEADH   : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ POPDENS : num  85.7 146.4 106.6 36.8 181.6 ...
##  $ EDUINDEX: num  0.0641 0.2672 0.1 0.0867 0.1309 ...

Note: When importing the dataset to R, some categorical variables were assigned as continuous variables, which is wrong. You should either assign them as categories, ou remove the variables.

Remove categorical variables or transform them into dummy (factor)

df = select(df,c(-DWELCLAS, -ISEX, -RAGE10, -HEADH))

Standardize variables (Zscore = xi - xmean)/sd

mean <- apply(df, 2, mean) # The "2" in the function is used to select the columns. MARGIN: c(1,2)
sd <- apply(df, 2, sd)
df_scaled <- data.frame(scale(df, mean, sd))

Take a look at the first values of the dataset

head(df_scaled,5)

##               INCOME    CHILD13        H18    HEMPLOY      HSIZE   AVADUAGE
## 799161661  1.0794858  0.7729638 -0.1241308  0.6432185  0.3332871 -0.5822216
## 798399409  0.1633759 -0.5107570 -1.2045288 -0.7124438 -1.2651308 -0.6828538
## 798374392  0.1633759  2.0566847 -0.1241308  0.6432185  1.1324961  0.3737844
## 798275277  1.0794858 -0.5107570  0.9562671  0.6432185  1.1324961  0.6924531
## 798264250 -0.5028859  0.7729638 -1.2045288 -0.7124438 -0.4659218 -0.4815894
##                 IAGE      NCARS       AREA    BEDROOM       PARK    BEDSIZE
## 799161661 -0.4215659  0.3285921 -0.2711861 -0.8178146  0.1760564 -0.8541923
## 798399409 -0.5078245 -0.7986947 -0.3534645 -0.8178146  0.1760564  0.8118945
## 798374392  0.4410200  0.3285921  0.7161546  1.0480730  1.2103874 -0.4376706
## 798275277  1.3036059  1.4558789 -0.1066293  0.1151292 -0.8582747 -0.7500619
## 798264250 -0.3353073 -0.7986947 -0.3534645 -0.8178146 -0.8582747 -0.4376706
##              PARKSIZE        TCBD    DISTHTC       TWCBD      TDWWK    POPDENS
## 799161661  0.05980121  0.74438167 -0.3953480 -0.43070769  0.4455778 -0.1082622
## 798399409  0.94365558 -0.56854349 -0.4385995 -0.09209531 -1.3766218  0.9352160
## 798374392  0.94365558 -0.03734795 -0.4400940 -0.26411662 -0.1839164  0.2509393
## 798275277 -0.82405316  0.24759440  0.5527119 -0.85160498  0.5508766 -0.9487325
## 798264250 -0.82405316 -1.07291599 -0.3572443 -0.71640988 -0.6136298  1.5398646
##             EDUINDEX
## 799161661 -1.0852110
## 798399409  0.5511548
## 798374392 -0.7960217
## 798275277 -0.9028004
## 798264250 -0.5467555

Compare the boxplots of the original dataset and standardized one

boxplot(df)

boxplot(df_scaled)

Evaluating the assumptions for factorial analysis

Adequate Sample Size: Ideally, you should have at least 10 observations per variable for reliable factor analysis results. More is generally better.

n_obs <- nrow(df_scaled)
n_vars <- ncol(df_scaled)
ratio=n_obs/n_vars

n_obs

## [1] 470

n_vars

## [1] 19

ratio

## [1] 24.73684

Normality

shapiro.test(df_scaled$CHILD13)  # Test normality of each variable

## 
##  Shapiro-Wilk normality test
## 
## data:  df_scaled$CHILD13
## W = 0.57099, p-value < 2.2e-16

Make normality tests to all variables

normality_tests <- sapply(df_scaled, function(x) {
  if (is.numeric(x)) {
    shapiro.test(x)
  } else {
    NA  # Skip non-numeric variables
  }
})

Print the p-values for each variable

normality_tests

##           INCOME                        CHILD13                      
## statistic 0.8261471                     0.5709899                    
## p.value   3.299416e-22                  2.331789e-32                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           H18                           HEMPLOY                      
## statistic 0.8375837                     0.8244585                    
## p.value   1.590027e-21                  2.633268e-22                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           HSIZE                         AVADUAGE                     
## statistic 0.8994548                     0.932016                     
## p.value   4.904234e-17                  8.595638e-14                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           IAGE                          NCARS                        
## statistic 0.9353388                     0.8671536                    
## p.value   2.101209e-13                  1.41819e-19                  
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           AREA                          BEDROOM                      
## statistic 0.3810604                     0.8987106                    
## p.value   4.702747e-37                  4.225014e-17                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           PARK                          BEDSIZE                      
## statistic 0.7906915                     0.8056548                    
## p.value   3.957969e-24                  2.373473e-23                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           PARKSIZE                      TCBD                         
## statistic 0.7774897                     0.9378631                    
## p.value   8.826135e-25                  4.232035e-13                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           DISTHTC                       TWCBD                        
## statistic 0.6022255                     0.8209078                    
## p.value   1.970889e-31                  1.647601e-22                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           TDWWK                         POPDENS                      
## statistic 0.9527661                     0.9670161                    
## p.value   4.122546e-11                  8.739157e-09                 
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "x"                           "x"                          
##           EDUINDEX                     
## statistic 0.9392883                    
## p.value   6.337856e-13                 
## method    "Shapiro-Wilk normality test"
## data.name "x"

Or Extract p-values from the normality test results ONLY

normality_pvalues <- sapply(df_scaled, function(x) {
  if (is.numeric(x)) {
    shapiro.test(x)$p.value
  } else {
    NA  # Skip non-numeric variables
  }
})

Print the p-values for each variable

normality_pvalues

##       INCOME      CHILD13          H18      HEMPLOY        HSIZE     AVADUAGE 
## 3.299416e-22 2.331789e-32 1.590027e-21 2.633268e-22 4.904234e-17 8.595638e-14 
##         IAGE        NCARS         AREA      BEDROOM         PARK      BEDSIZE 
## 2.101209e-13 1.418190e-19 4.702747e-37 4.225014e-17 3.957969e-24 2.373473e-23 
##     PARKSIZE         TCBD      DISTHTC        TWCBD        TDWWK      POPDENS 
## 8.826135e-25 4.232035e-13 1.970889e-31 1.647601e-22 4.122546e-11 8.739157e-09 
##     EDUINDEX 
## 6.337856e-13

If you want to focus on which variables do not follow normality, you can filter the output:

non_normal_vars <- names(normality_pvalues[normality_pvalues < 0.05])
non_normal_vars

##  [1] "INCOME"   "CHILD13"  "H18"      "HEMPLOY"  "HSIZE"    "AVADUAGE"
##  [7] "IAGE"     "NCARS"    "AREA"     "BEDROOM"  "PARK"     "BEDSIZE" 
## [13] "PARKSIZE" "TCBD"     "DISTHTC"  "TWCBD"    "TDWWK"    "POPDENS" 
## [19] "EDUINDEX"

If you want to focus on which variables follow normality, you can filter the output:

normal_vars <- names(normality_pvalues[normality_pvalues >= 0.05])
normal_vars

## character(0)

Check histograms to confirm non-normality. Example of histogram:

hist(df_scaled$H18)

Note: There are no normally distributed variables in the dataset. Factor analysis can still work with non-normally distributed data, but for Maximum Likelihood (ML) extraction, normality is preferable.

Linearity Between Variables

pairs(df_scaled[,1:10], pch = 19, lower.panel = NULL)

# Pairwise scatter plots for the first 10 variables. Check the others!

You can also create an individual scatterplot between two variables:

plot(df_scaled$BEDROOM, df_scaled$H18, 
     main = "Scatterplot", 
     xlab = "BEDROOM", 
     ylab = "H18")
abline(a = 0, b = 1, col = "red", lty = 2)

Note: Most relationships are non-linear.

Correlations between variables

#' Correlation matrix
corr_matrix <- cor(df_scaled, method = "pearson")
corrplot(corr_matrix, method = "square", type = "upper")

Check if the correlation is statistically significant.

cor.test(df_scaled$H18,df_scaled$HSIZE)

## 
##  Pearson's product-moment correlation
## 
## data:  df_scaled$H18 and df_scaled$HSIZE
## t = 23.616, df = 468, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6931681 0.7760734
## sample estimates:
##       cor 
## 0.7373854

Note: The null hypothesis is that the correlation is zero. This means that the correlations are only significant when you reject the null hypothesis (pvalue < 0.05).

The Bartlett sphericity test

This test checks if the correlation matrix is significantly different from an identity matrix (where variables are uncorrelated). If significant, the data is suitable for factor analysis. If the p-value is small (p < 0.05), you can reject the null hypothesis (which states that the variables are uncorrelated), meaning that FA is appropriate.

cortest.bartlett(corr_matrix, n = nrow(df_scaled))

## $chisq
## [1] 4794.639
## 
## $p.value
## [1] 0
## 
## $df
## [1] 171

Note: The null hypothesis is that there is no correlation between variables. Therefore, in factor analysis you want to reject the null hypothesis.

Check for sampling adequacy - KMO test

It assesses whether the correlations between variables are high enough to justify a factor analysis. It looks at the proportion of variance that could be common among variables.

KMO(corr_matrix)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = corr_matrix)
## Overall MSA =  0.59
## MSA for each item = 
##   INCOME  CHILD13      H18  HEMPLOY    HSIZE AVADUAGE     IAGE    NCARS 
##     0.86     0.30     0.53     0.87     0.57     0.56     0.53     0.73 
##     AREA  BEDROOM     PARK  BEDSIZE PARKSIZE     TCBD  DISTHTC    TWCBD 
##     0.55     0.53     0.55     0.59     0.48     0.69     0.77     0.52 
##    TDWWK  POPDENS EDUINDEX 
##     0.70     0.79     0.75

Note: We want at least 0.7 of the overall Mean Sample Adequacy (MSA). If, 0.6 < MSA < 0.7, it is not a good value, but acceptable in some cases.

MSA for each item (variable-specific KMO scores): The KMO test also provides an MSA (Measure of Sampling Adequacy) for each variable, indicating how well each variable fits with the others in terms of common variance. Here are some important interpretations:

Good MSA values (≥ 0.70): Variables such as INCOME (0.86), HEMPLOY (0.87) are well-suited for factor analysis and share sufficient common variance with the other variables.
Mediocre/bad MSA values (0.50 ≤ MSA < 0.70): Variables like HSIZE (0.57), and PARK (0.55) are marginal for factor analysis. These variables have some shared variance with the other variables but are not as strong contributors.
Low MSA values (< 0.50): Variables like CHILD13 (0.30), and PARKSIZE (0.48) have very low MSA scores. These variables do not share enough common variance with the others and might be poorly suited for factor analysis.

You exclude the low MSA value variables:

new_df <- df_scaled |> select(-CHILD13, -PARKSIZE)

Correlation matrix

corr_matrix <- cor(new_df, method = "pearson")
KMO(corr_matrix)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = corr_matrix)
## Overall MSA =  0.67
## MSA for each item = 
##   INCOME      H18  HEMPLOY    HSIZE AVADUAGE     IAGE    NCARS     AREA 
##     0.85     0.85     0.88     0.64     0.56     0.55     0.81     0.74 
##  BEDROOM     PARK  BEDSIZE     TCBD  DISTHTC    TWCBD    TDWWK  POPDENS 
##     0.41     0.76     0.46     0.68     0.77     0.52     0.69     0.74 
## EDUINDEX 
##     0.76

Check for multicollinearity. Multicollinearity happens when variables are very highly correlated (e.g., correlations above 0.9), which can distort factor analysis results.

vif(lm(df_scaled$INCOME ~ ., data = new_df))  # Replace INCOME with other variables

##      H18  HEMPLOY    HSIZE AVADUAGE     IAGE    NCARS     AREA  BEDROOM 
## 2.855273 1.605823 5.841001 1.761272 1.788480 1.828819 1.420604 4.418189 
##     PARK  BEDSIZE     TCBD  DISTHTC    TWCBD    TDWWK  POPDENS EDUINDEX 
## 1.348317 5.185469 2.332047 1.603804 1.376212 1.485086 1.484629 1.346856

Note: If the VIF of any variable is greater than 10, multicollinearity may be an issue. Consider removing or combining highly correlated variables. May be the problem here is that VIF are too low!?

Determine the number of factors to extract

1. Parallel Analysis

num_factors = fa.parallel(new_df, fm = "ml", fa = "fa")

## Parallel analysis suggests that the number of factors =  5  and the number of components =  NA

Note: fm = factor math; ml = maximum likelihood; fa = factor analysis

The selection of the number of factors in the Parallel analysis can be threefold:

Detect where there is an “elbow” in the graph;
Detect the intersection between the “FA Actual Data” and the “FA Simulated Data”;
Consider the number of factors with eigenvalue > 1.

2. Kaiser Criterion

sum(num_factors$fa.values > 1) #Determines the number of factors with eigenvalue > 1

## [1] 3

You can also consider factors with eigenvalue > 0.7, since some of the literature indicate that this value does not overestimate the number of factors as much as considering an eigenvalue = 1.

3. Principal Component Analysis (PCA)

Print variance that explains the components

df_pca <- princomp(new_df, cor=FALSE) #cor = TRUE, standardizes your dataset before running a PCA
summary(df_pca)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.9370196 1.5918203 1.3929124 1.24631263 1.07780840
## Proportion of Variance 0.2211791 0.1493703 0.1143730 0.09156512 0.06847929
## Cumulative Proportion  0.2211791 0.3705494 0.4849224 0.57648757 0.64496685
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.98383064 0.89742365 0.83885597 0.77390307 0.73480693
## Proportion of Variance 0.05705803 0.04747567 0.04148116 0.03530606 0.03182897
## Cumulative Proportion  0.70202488 0.74950056 0.79098172 0.82628777 0.85811674
##                           Comp.11    Comp.12    Comp.13    Comp.14    Comp.15
## Standard deviation     0.72482791 0.69586966 0.67614397 0.56328154 0.54655748
## Proportion of Variance 0.03097034 0.02854512 0.02694973 0.01870368 0.01760953
## Cumulative Proportion  0.88908708 0.91763220 0.94458193 0.96328562 0.98089515
##                           Comp.16     Comp.17
## Standard deviation     0.49692369 0.277773884
## Proportion of Variance 0.01455645 0.004548403
## Cumulative Proportion  0.99545160 1.000000000

Scree Plot

plot(df_pca,type="lines", npcs = 17, las = 2)

Note: Check the cummulative variance of the first components and the scree plot, and see if the PCA is a good approach to detect the number of factors in this case.

PCA is not the same thing as Factor Analysis! PCA only considers the common information (variance) of the variables, while factor analysis takes into account also the unique variance of the variable. Both approaches are often mixed up. In this example we use PCA as only a first criteria for choosing the number of factors. PCA is very used in image recognition and data reduction of big data.