# Variance and standard deviation

## The complete guide to understand variance and standard deviation

Descriptive statistics are used to describe the basic information about the data. They help us understand some features of the data by giving short summaries about the sample. It’s like the first impression of what the data shows us. In a nutshell, descriptive statistics are metrics and quantitative analysis to briefly describe our sample. And they’re broken down into measures of central tendency and measures of variability. In this article, we’ll discuss the latter and we’ll clear the fog on famous questions about variance and standard deviation.

# Variance

## Definition

Suppose that we have a random variable, by definition, it can take different values. The distribution of this random variable is what determines its range and its variation. Now we want to have a metric to measure how much this variable varies. This is what we call variance: it’s a metric to describe the spread between data set from its mean value. It can also be seen as the measure of the width of a distribution but this is more related to normal distribution as for other distributions.

Mathematically, in a population, we can calculate it as: the mean of the square distance between each point and the mean, this is correlated with how far each point in the data set is from the mean.

This formula can be expressed more generally and more directly as following because the expected value of a random variable in a population is just its mean.

There’s also another formula for variance, it’s another view of how we perceive the notion of dispersion and it’s more simple and elegant.

This one says: Variance is the difference between the expected value of squared inputs and the square of the expected value of the input.

Even though this formula is simple but it stays less interpretable than the first one. We can just see it as a simplification of the other one.

# Standard deviation

Many people think that variance is an annoying step to compute the standard deviation and they’re somehow right, because we usually tend to think about normal distribution as it’s ubiquitous in probability, statistics and our life. Another argument is that standard deviation helps us construct a confidence interval given a mean in a normal distribution.

Now the next step after computing variance is the square root and the famous sigma enters the picture.

This formula gives us a sort of average of how far a point is far from the mean. But are you thinking what I’m thinking about:

## Why square rooting a sum of squares rather than the absolute value ?

Actually, this is a legitimate question to ask and I should admit that this question took me a lot of time to understand and I still see statisticians do not give it its amount of time to understand it and thus explain it.

First of all, this is a 100-year-old debate. It called Standard Deviation (SD) vs Mean Deviation (MD), and there are two arguments of why statisticians use SD rather than MD.

- The first one is what Fisher has pointed in 1920
*under ideal circumstances.*Fisher proved that the two statistics are good enough to describe the population deviation but he found out that the standard deviation is more efficient in sense of having the smallest probable error as an estimate of the population parameter. Concretely, when dealing with repeated large samples he reveals that the standard deviation of the mean deviations is 14% more than the standard deviation of the standard deviations [1]. Therefore, SD is more consistent than MD and this is why SD has been preferred is statistical theory. - The second argument is that the MD is quite difficult to manipulate algebraically, however squaring makes algebra much easier to work with and offers properties that MD does not. Let’s take the normal distribution as an example: we can state quite precisely the percentage of the distribution lying within each standard deviation from the mean. There’s a very big theory and a complex form of statistics based on standard deviation like: least squares regression, analysis of variance, central limit theorem and so on.

We should note that the SD is very affected by the non-normality in particular it’s affected by extremely high and extremely low values. There are some studies that pointed out that MD is efficient in realistic circumstances and for distributions other than perfectly normal, and after all MD is easier to understand. So we can say that the choice depends on data.

Another point for interpreting SD and making you much familiar with it, is to look at our data as a multidimensional space, where each observation is a value on a different dimension. This aligns with the independence of observations to have an Euclidean space. So from the Pythagorean theorem, the distance between two vectors is:

and the standard deviation is this distance normed by the number of observations, assuming the **y** vector is the mean vector.

# Estimation of Variance

Now suppose that we have a sample and we want to estimate the true variance of the population. We can naively estimate it as its original formula:

This is actually a biased estimator of the true variance. In fact, we’re underestimating the variance with this formula and we have:

Let’s prove why this formula is true whatever the value of the mean is.

In other words, the differences between data and the sample mean tend to be smaller than the data and the population mean unless the sample mean is the exact same as the population mean, and this pretty much never happens. Thus the differences around the population mean will result a larger value and this larger value is what we are supposed to estimate. However, we’re underestimating it. So to compensate this underestimation, we need to divide by n-1 for measuring distances from the sample, and the new formula is:

## Why dividing by n-1 and not n-2 or n-1.5 ?

Another legitimate question, is why dividing by an exact number is sort of calibrating our estimation ?

In statistics, we evaluate the goodness of an estimator by checking if it is unbiased. This means that the expected value of the estimator equals to the true value. For example, the mean estimator is unbiased.

Now let’s do the same check for the variance and see whether its estimator is unbiased or not.

# Conclusion

What you need to keep from this article are:

- Variance measures the spread of data with an emphasis on extreme low and high values.
- Standard deviation is the data fluctuation metric that measures on average on how far a point is from the mean.
- We use standard deviation over mean deviation because the former is more consistent and the latter is difficult to manipulate algebraically.
- We divide the estimator by n-1 because when dividing by n we underestimate it which means diving by n makes the estimator biased.

[1] Stephen Gorard, Revisiting a 90-year-old debate: the advantages of the mean deviation (2004), Why do we use the standard deviation?