Introduction to Principal Component Analysis (PCA)

Learn PCA with its interpretation and its implementation in R

6 min readOct 8, 2020

Principal Component Analysis (PCA) is a method of dimensionality reduction, it can be used for feature extraction or representation learning. It transforms the data from a d-dimensional space into a new coordinate system of p dimensions(p≤d), and extracting the most important q variables(q << d)

When should I use it ?

First of all we need to know that PCA works only with continuous variables. So if you have a mixture of categorical and continuous variables you have to select only the non-discrete ones.

We can use PCA:

Just to visualize data in a space of two or three dimensions
If the interpretation of your model features isn’t very important to you
If you want to make your features independent
If you can tolerate losing a part of information in your data

The core of PCA

The main idea in PCA is the correspondence between the information that the data gives us and the variance of its features.

Say we have our population of people with different ages, jobs and weights, but they all have the same height. Since the height feature is the same for all observations, it doesn’t give any information about every individual and doesn’t make any individual different from the other. Whereas the other columns give us the distinction we need to see between our observations. This means the information of a column is embodied in its variance, and since the variance of the height is zero, that means that doesn’t indicate any sign about how our observations differ. This is why we can’t use the categorical variables in our analysis because the variation isn’t correctly described. But there are appropriate techniques to deal with this problem such as MFA(Multiple Factorial Analysis) or Polychoric correlation for mixed data but this is not our topic today.

Remember: because PCA works with variation, it’s mandatory to normalize your data since it’s all about Euclidean distance. This is also one of the PCA drawbacks because it’s highly biased by the outliers. So it is recommended to do some outliers analysis before treating your data.

How PCA works ?

By the main idea we explained above, PCA tries to search for the best axis that captures the maximum variance of the data.

The example below shows that the red axis is the one that captures the maximum variance, then the green one captures the second maximum variation of the data. The right image is the new coordinate system where the red axis is the first PCA component and the green is the second one.

Image from setosa.io where we transform the data into a new coordinate system

Mathematically speaking

Now, let’s assume that we have data in a d-dimension space(with d features), and we want to compute the first principal component u₁(the axis that captures the highest variation)

Let X be our data matrix (n, d) and S its variance-covariance matrix (d, d). We’re looking for u that maximizes the variation of the projected data on it. Since the variance is quadratic, we have:

Now we want to maximize this formula. So the two degrees of freedom are:

the direction of u
the length of u

As long as we have the direction we don’t care about the length. So, to make the problem well defined, because isn’t yet since it’s not a bounded problem, we need to add a constraint to our optimization problem. We choose for example the length to be fixed in 1.

Seems familiar ? Yes that’s what i thought too. This is the eigenvalue and eigenvector. So u is the eigenvector of S and 𝜆 is the eigenvalue of S.

Now since S is a (d, d) matrix, it has d eigenvectors. Which one should I choose so that I can solve my problem? It’s simple, let’s replace our solution in the optimization problem.

Tadaah, we did it! Thus, the first principal component u that captures the maximum variation is the eigenvector that has the maximum eigenvalue 𝜆. With the same sort of argument we can prove that the second principal component which maximizes the second variation is the eigenvector that has the second maximum eigenvalue, and so on until the last one.

In the next sections, we’ll work on a basketball players dataset with features like weight, age, salary, position, points, etc. You can find the link to the github repository in the end of this article.

How many components should I choose?

Before we get to know how many components we should choose, we need to define how much information every component have from the original dataset. This is basic, no more ugly mathematics formulas, I promise.

Every component has an eigenvalue, so the information embodied in every component is its eigenvalue divided by the sum of all eigenvalues. For example, for the first component, it’s: λ₁/(λ₁ + λ₂ + … + λp)

The information of the first ten components (source: Image by author)

Now as we computed all the mathemagic stuff, we need to determine how many features we need to keep. Generally, we have two common methods to deal with this:

Method 1: In a nutshell we just pick out first m(arbitrarily chosen) component. This is used usually when we just want to visualize our data. In general, 2 or 3 components are selected to visualize the data. Although, this is not a consistent way to choose the number.

Plotting data in the first two components (source: Image by author)

Method 2: This one is more consistent and more logical. We have to determine a threshold of the minimum information we should have before treating the data. For example I want to have at least 75% of the information conserved, so we’ll add components until we hit or exceed 75%.

Interpreting the new features

Interpreting the new features isn’t an easy problem. It can be very hard to know what are the new features and interpret them but it’s worth a try. To do that, we project every feature in two by two components(usually the first two only).

Plotting every feature in the first two components (source: Image by author)

From the example above, we can see that height and weight are very correlated with the first component. “Points”, “fields goals made”, “fields goals”, “free throws attempted” are very correlated to the second component. Thus, we can safely assert that the first component is a body(height, weight) index and the second one is for performance.

Now we can understand any plot in the first two components. Let’s plot every player there is, try to understand, and have some first ideas about the data. With a beautiful touch in R, we can easily have the graph below

Plotting players in the first two components with convex ellipses (source: Image by author)

As the first new feature now is body(height, weight), it is totally fair to claim that center players have bigger bodies than forward ones, and that the latter have bigger bodies than the guard ones. In terms of their performance(the second new feature), we cannot see any glaring difference between these groups.

Conclusion

What you need to keep from this article is:

PCA is a method of dimensionality reduction to help us visualize data and reduce the dimension of work space.
The core of PCA is that the information is embodied in variation of data.
Data should be normalized and it is recommended to remove outliers.
Components of PCA are the eigenvectors of the variance-covariance matrix of our data.
The information embodied in a component is its eigenvalue divided by the sum of all eigenvalues
To interpret the new features we can plot all features projection in two by two components

You can find the source code of the example treated here.