# Covariance, correlation and R-squared

## The complete guide to understand covariance, correlation and R²

In the last article we talked deeply about variance and standard deviation and how they will briefly summarize all features for a given data. They are the first impression for a given variable, to see how it is distributed in our sample. Based on those, today we are going to treat three other similar metrics. And instead of treating every feature alone, we’re going to treat them two by two to see if there’s a relationship between them, what kind of relationship and how strong it is. Those metrics are: Covariance, correlation and R². They are very similar to each other in statistics and probability theory. They have a big utility in statistics and machine learning field. So what is the definition behind these metrics? what are the differences between them and how can we use them?

# Covariance

## Definition

Remember when last time we talked about variance and standard deviation, we have said that these metrics are measuring the spread and the dispersion of the data for a given feature or variable. Covariance uses the same principal but this time between multiple variables. In fact, it generalizes variance notion to the scale of two features. In a nutshell, covariance measures the joint dispersion or the joint variation of two variables.

Mathematically speaking, the covariance of two variables of a sample is the mean of, the distance between first variable and its mean and the distance between second variable and its mean. So collectively, the idea of variation is evaluated in our sample.

Note: If we are talking about a population (everyone/everything) we should divide by n otherwise and most likely it’s just a sample, we should divide by n-1. The reason behind this is explained here

This formula can be expressed more generally and more directly as following:

Note that the covariance of a variable with itself is its variance. We can see it easily coming from the first formula.

The value of covariance can be any number between the two infinities.

## Interpretability

Now that we know the definition of covariance and its mathematical formula, we want to know what kind of information it shows and what does it tell us this very metric.

Suppose that we have the following data from two variables X and Y:

This means that the more points we have in the green area, the more positive covariance will be. And the more points we have in red area, the more negative covariance will be. The figure bellow shows this:

So generally we can conclude that positive covariance means points are slopping upwards (positive trend), and negative covariance means points are slopping downwards (negative trend). This will leads to the reason why we have these two shapes. It’s called dependency.

## Covariance and dependency

Dependency of a variable to another means that if one variable changes implies a change in the other. This change may be with the same sign or with the opposite sign. For example the weight and height are dependent in a positive way, i.e if you are tall you’re likely to have more weight than a short person. Another example for a negative way is the dependency between the altitude and temperature.

Note that the value of covariance does not have any relation with the magnitude of dependency. Only non-zero covariance tells us that there’s a dependency but not how much variables are dependent.

We can actually prove that easily as following:

We can also prove that there isn’t an equivalence between being independent and having zero covariance. Let’s dig in with this counter example:

So we can have two variables that are dependent but their covariance is zero.

We have seen now that covariance gives us the hint that two variables are dependent if it’s different from zero, and we have seen the two examples when the covariance is positive and the covariance is negative. But we want to interpret the value of this covariance and we have the impression that if it gets bigger the dependency is bigger. This is not the case most of the time.

Let’s suppose that you have two variables X and Y that are perfectly dependent such that: Y = 2X. And let’s have five individuals height and weight. We have these two following tables:

We can see that for the perfect dependency we have a smaller covariance than height and weight who are not perfectly linear. This is due to the scale of our data.

So covariance, in and of itself, is not very interesting when it comes to the magnitude of its value. It’s just a computational step to something more interesting. Here comes the scaling part to make it more significant, this part is correlation.

# Correlation

In a nutshell, correlation is covariance + meaningful magnitude that is always between -1 and 1. It has the same properties as covariance which we’ll remind:

- For a
**positive**value: we can say that lower values of the first variable tend to be paired to lower values for the second variable. And higher values for one tend to be paired to higher values for the other. - For
**negative**value: we can conclude that lower values of one variable tend to be paired to higher value of the other and vice-versa. - If the value is
**zero**: we cannot conclude the type of the relationship neither its quality.

Now, in addition to covariance properties, comes the most interesting part: the magnitude of the value. With correlation we can quantify the strength of a relationship.

Remember when we talked about the scaling part in our example. This is the reason why the value of covariance is not making sense. The variation of the scale between data makes the value instable between the two infinities without having a meaningful interpretation apart from evoked properties above. So let’s scale the data and resolve this problem.

This is where correlation came from. One of its properties is that correlation is always between -1 and 1. This is now interpretable and we can conclude that:

- If the absolute value of correlation is near 1, this means that the relationship between the two features is
**strong** - if the absolute value is near 0, the relationship is weak.

So the farther the absolute value of correlation is from 1, the weaker the relationship between the two variables.

We can simply prove that the correlation is between -1 and 1.

Correlation equals 1 when a straight line with a positive slope goes through every point from our data. This means that if someone gave us one variable we can predict perfectly the other. The same goes when the correlation equals -1 with a negative slope.

Note that we need data to have confidence on the strength of the relationship. In other words we can have correlation equals 1 for any two points because we can connect them with a line easily. So the more data we have the more confident we are for the relationship’s strength.

## Limitations of correlation

Even though correlation gives us more information than covariance, its values are not very informative. We can just say that correlation 0.9 is better than 0.64 but can we say that it’s twice good ? or that 0.9 correlation value means that 90% of the data variation was explained by the relationship between variables.

Another important limitation to know is that correlation and covariance measure just linear relationship between the two features and they cannot measure any non-linear relationship.

Good news, R² is here to resolve the first problem and the second one in linear regression.

# R² or R-squared

## Definition

R-squared is a measure that is very similar to correlation. In fact, in simple linear relationships r-squared as its name says is the square of r which is correlation. It’s more interpretable and gives us more information than correlation in itself.

One of the definitions that is so easy to understand: It is how much of one variable is explained by the other. And this is where its formula came from. For example if we have two variables X and Y and we want to know how much of Y is explained by X. To do so, we measure the regressor variance over the variance of the variable we want to explain. In this case the regressor variance is the Var(Y)-Var(residuals) as shown bellow.

Then if in the example above we had R²=70%, we can say that X explains 70% of the variation of Y.

We can also write the formula differently with sum of squares instead of variance as following.

## R and R²

A lot of people might ask the legitimate question: why do we call R-squared as the square of R which is the correlation coefficient. It’s a good question that seems not to be very intuitive to answer. We gonna clear the fog here doing a small calculus:

Let’s make things simple and prove it in a simple linear regression.

Note: You can find from here how the analytical solution is computed

## Reminder: R² is not valid for non linear regression

Let’s first define what’s a linear regression. Linear regression is not only:

This is why R-squared is always between 0 and 1. However this is not the case for non linear regression, because the variance does not add up as above. This is why we can have errors doing it. We may have a negative R² or 1 < R².

# Conclusion

In this relatively long article, we have seen a lot of definitions and properties. But we should keep in mind:

- Covariance measures the spread of data like variance but the difference is that covariance does it between two variables.
- Positive covariance means points are slopping upwards and vice-versa.
- Independent variables have zero covariance.
- The magnitude of covariance is not very useful. This is why correlation (= the scaled covariance), is interpretable.
- Correlation has a limitation in terms of comparison between two values and it uses only a simple linear regression.
- R² determines the explained variance of the regressor.