Project #5: Performing Principal Component Analysis
- Ira
- Apr 1, 2021
- 1 min read
In an effort to look into undergraduate educational rankings across America, I was given a dataset containing 1302 American colleges and universities. Each education institution contains 17 variables (continuous and categorical) that serves as comparison measurements.
As an initial approach, I removed all categorical variables. This step was followed with eliminating the missing numerical measurements from the dataset.

After the cleaning process, the residual continuous variables was utilized to perform Principal Component Analysis (PCA) as seen below.

After conducting the PCA on the cleaned data, it can be observed that the first 2 PC account for 92.5% of the total variance. It can be noted that both variables "in-state tuition" and "out-of-state tuition" fees dominate the first PC. It can be deduced that these variables are both in US dollar amounts with relatively have higher values than the rest of the fees (e.g. "estim. book costs" and "add. fees") and other variables (e.g. "#new stud. enrolled" and "Graduation rate"). Data normalization is necessary in order to create better conclusions on this dataset.
Comments