In a recent tweet I asked the question why we use \(n - 1\) to calculate the variance of a sample. Many people contributed an answer, but many of them were of the type I feared. The response consists of some statistical jargon that confuses me more, rather than less. Some of the responses were very useful, though, so I recommend checking out the replies to the tweet. Based on some of the responses I received, I will try to describe my favorite way of looking at the issue.

If you want to follow along in R, you can copy the code from each code section; beginning with some setup code.

```
# Load packages
library(tidyverse)
library(patchwork)
# Create color variables for the plots
green <- "#00B88D"
yellow <- "#f39c12"
off_white <- "#cccccc"
# Create our own variance function that returns the population or sample
# variance
my_var <- function(x, population = FALSE) {
if (population) {
sum((x - mean(x))^2)/length(x)
} else {
sum((x - mean(x))^2)/(length(x) - 1)
}
}
```

## The Formula

The formula for calculating the variance is:

\[\frac{\sum(x_i - \overline{x})^2}{n}\]

The variance is a measure of the dispersion around the mean, and in that sense this formula makes sense. We calculate all the deviations from the mean (\(x_i - \overline{x}\)), square them (for reasons I might go into in a different post) and sum them. We then divide this sum by the number of observations as a scaling factor. If we ignore this number, we could get a very high variance simply by observing a lot of data. So, to fix that problem, we divide by the total number of observations.

However, this is the formula for the *population* variance. The formula for calculating the variance of a *sample* is:

\[\frac{\sum(x_i - \overline{x})^2}{n - 1}\]

Why do we divide by *n* - 1?

If you Google this question, you will get a variety of answers. You might find a mathematical proof of why it needs to be \(n - 1\) or something about degrees of freedom. These kinds of answers don’t work for me. I trust them to be correct, but it doesn’t produce any insight. It does not actually help me understand 1) the problem and 2) why the solution is the solution that it is. So, below I am going to try to figure it out in a way that actually makes conceptual and intuitive sense (to me).

## The Problem

The problem with using the population variance formula to calculate the variance of a sample is that it is biased. It is biased in that it produces an *underestimation* of the true variance. Let’s demonstrate that with some simulated data.

We simulate a population of 1000 data points from a uniform distribution with a range from 1 to 10. Below I show the histogram that represents our population.

```
# Set the seed for reproducibility
set.seed(1212)
# Create a population consisting of values ranging from 1 to 10
population <- sample(1:10, size = 1000, replace = TRUE)
# Get the mean and population variance
mu <- sum(population)/length(population)
sigma <- my_var(population, population = TRUE)
# Visualize the population
ggplot(tibble(population = population), mapping = aes(x = population)) +
geom_bar(fill = green) +
labs(x = "x", y = "n") +
scale_x_continuous(breaks = 1:10)
```

The variance is 8.76. Note that this is our population variance (often denoted as \(\sigma^2\)). We want to estimate this value using samples drawn from our population, so let’s do that.

To start, we can draw a single sample of size 5. Say we do that and get the following values: 7, 6, 3, 5, 5. We can then calculate the variance in two ways, using division by \(n\) and division by \(n - 1\). In the former case, this will result in 1.76 and in the latter case it results in 2.2.

Now let’s do that many many times. Below I show the results of draws from our population. I simulated drawing samples of size 2 to 10, each 1000 different times. I then plotted for each sample size the average biased variance (dividing by \(n\); left) and the average unbiased variance (dividing by \(n - 1\); right).

```
# Create an empty data frame with the simulation parameters
samples <- crossing(
n = 2:20,
i = 1:1000
)
# Calculate the mean, sample variance, and population variance
# for each combination of n and i
samples <- samples %>%
rowwise() %>%
mutate(
var_unbiased = my_var(sample(population, n), population = FALSE),
var_biased = my_var(sample(population, n), population = TRUE)
)
p_biased <- ggplot(samples, aes(x = n, y = var_biased)) +
geom_hline(yintercept = sigma, linetype = "dashed", color = yellow) +
stat_summary(fun = "mean", geom = "point", color = green, size = 3) +
labs(x = "Sample size (n)", y = "Variance") +
coord_cartesian(ylim = c(0, sigma + 1)) +
scale_x_continuous(breaks = seq(from = 2, to = 20, by = 2)) +
ggtitle("Variance with division by n")
p_unbiased <- ggplot(samples, aes(x = n, y = var_unbiased)) +
geom_hline(yintercept = sigma, linetype = "dashed", color = yellow) +
stat_summary(fun = "mean", geom = "point", color = green, size = 3) +
labs(x = "Sample size (n)", y = "Variance") +
coord_cartesian(ylim = c(0, sigma + 1)) +
scale_x_continuous(breaks = seq(from = 2, to = 20, by = 2)) +
ggtitle("Variance with division by n - 1")
p_biased + p_unbiased
```

We see that the biased measure of variance is indeed biased. The average variance is lower than the true variance (indicated by the dashed line), for each sample size. We also see that the unbiased variance is indeed unbiased. On average, the sample variance matches that of the population variance.

The results of using the biased measure of variance reveals several clues for understanding the solution to the bias. We see that the amount of bias is larger when the sample size of the samples is smaller. So the solution should be a function of sample size, such that the required correction will be smaller as the sample size increases. We also see that that the bias at \(n = 2\) is half that of the true variance, \(\frac23\) at \(n = 3\), \(\frac34\) at \(n = 4\), and so on. Interesting.

But before we go into the solution, we still need to figure out what exactly causes the bias.

Ideally we would estimate the variance of the sample by subtracting each value from the population mean. However, since we don’t know what the population mean is, we use the next best thing—the sample mean. This is where the bias comes in. When you use the sample mean, you’re guaranteed that the mean lies somewhere within the range of your data points. In fact, the mean of a sample minimizes the sum of squared deviations from the mean. This means that the sum of deviations from the sample mean is *always* smaller than the sum of deviations from the population mean. The only exception to that is when the sample mean happens to be the population mean.

Let’s illustrate this with a few graphs. Below are two graphs. In each graph I show 10 data points that represent our population. I also highlight two data points from this population, which represents our sample. In the left graph I show the deviations from the sample mean and in the right graph the deviations from the population mean.

```
x <- c(1, 2, 4, 4, 4, 6, 8, 9, 10, 10)
sample1 <- tibble(
index = 1:10,
value = x,
sample = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
mean = mean(c(8, 10)),
mu = mean(x)
) %>%
mutate(
mean = ifelse(sample == 1, mean, NA),
mu = ifelse(sample == 1, mu, NA)
)
p1 <- ggplot(sample1, aes(x = index, y = value, color = factor(sample))) +
geom_hline(aes(yintercept = mu), linetype = "dashed", color = off_white) +
geom_point(size = 2.5) +
geom_hline(aes(yintercept = mean), linetype = "solid", color = green,
size = 1) +
geom_segment(aes(xend = index, y = mean, yend = value), linetype = "solid",
size = 1) +
coord_flip() +
scale_color_manual(values = c(off_white, green)) +
scale_y_continuous(breaks = 1:10) +
guides(color = FALSE) +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(y = "") +
ggtitle("Deviations from the sample mean")
p2 <- ggplot(sample1, aes(x = index, y = value, color = factor(sample))) +
geom_hline(aes(yintercept = mean), linetype = "dashed", color = off_white) +
geom_point(size = 2.5) +
geom_hline(aes(yintercept = mu), linetype = "solid", color = green,
size = 1) +
geom_segment(aes(xend = index, y = mu, yend = value), linetype = "solid",
size = 1) +
coord_flip() +
scale_color_manual(values = c(off_white, green)) +
scale_y_continuous(breaks = 1:10) +
guides(color = FALSE) +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(y = "") +
ggtitle("Deviations from the population mean")
p1 + p2
```

We see that in the left graph the sum of squared deviations is much smaller than in the right graph. The sum is \((8 - 9)² + (10 - 9)² = 2\) in the left graph and in the right graph it’s \((8 - 5.8)² + (10 - 5.8)² = 22.48\). The sum is smaller when using the sample mean compared to using the population mean.

This is true for any sample you draw from the population (again, except when the sample mean happens to be the same as the population mean). Let’s look at one more draw where the sample mean is closer to the population mean.

```
sample2 <- tibble(
index = 1:10,
value = x,
sample = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 0),
mean = mean(c(2, 10)),
mu = mean(x)
) %>%
mutate(
mean = ifelse(sample == 1, mean, NA),
mu = ifelse(sample == 1, mu, NA)
)
p1 <- ggplot(sample2, aes(x = index, y = value, color = factor(sample))) +
geom_hline(aes(yintercept = mu), linetype = "dashed", color = off_white,
size = 0.75) +
geom_point(size = 2.5) +
geom_hline(aes(yintercept = mean), linetype = "solid", color = green,
size = 0.75) +
geom_segment(aes(xend = index, y = mean, yend = value), linetype = "solid",
size = 0.75) +
coord_flip() +
scale_color_manual(values = c(off_white, green)) +
scale_y_continuous(breaks = 1:10) +
guides(color = FALSE) +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(y = "") +
ggtitle("Deviations from the sample mean")
p2 <- ggplot(sample2, aes(x = index, y = value, color = factor(sample))) +
geom_hline(aes(yintercept = mean), linetype = "dashed", color = off_white,
size = 0.75) +
geom_point(size = 2.5) +
geom_hline(aes(yintercept = mu), linetype = "solid", color = green,
size = 0.75) +
geom_segment(aes(xend = index, y = mu, yend = value), linetype = "solid",
size = 0.75) +
coord_flip() +
scale_color_manual(values = c(off_white, green)) +
scale_y_continuous(breaks = 1:10) +
guides(color = FALSE) +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(y = "") +
ggtitle("Deviations from the population mean")
p1 + p2
```

Here the sum in the left graph is \((2 - 6)² + (10 - 6)² = 32\) and the sum in the right graph is \((2 - 5.8)² + (10 - 5.8)² = 32.08\). The difference is small now, but using the sample mean still results in a smaller sum compared to using the population mean.

In short, the source of the bias comes from using the sample mean instead of the population mean. The sample mean is always guaranteed to be in the middle of the observed data, thereby reducing the variance, and creating an underestimation.

## The Solution

Now that we know that the bias is caused by using the sample mean, we can figure out how to solve the problem.

Looking at the previous graphs, we see that if the sample mean is far from the population mean, the sample variance is smaller and the bias is large. If the sample mean is close to the population mean, the sample variance is larger and the bias is small. So, the more the sample mean moves around the population mean, the greater the bias.

In other words, besides the variance of the data points around the sample mean, there is also the variance of the sample mean around the population mean. We need both variances in order to accurately estimate the population variance.

The population variance is thus the sum of two variances:

\[\sigma^2_{sample} + \sigma^2_{\vphantom{sample}mean} = \sigma^2_{population}\] Let’s confirm that this is true. For that we need to know how to calculate the variance of the sample mean around the population mean. This is relatively simple; it’s the variance of the population divided by n (\(\frac{\sigma^2}n\)). This makes sense because the greater the variance in the population, the more the mean can jump around, but the more data you sample, the closer you get to the population mean.

Now that we can calculate both the variance of the sample and the variance of the sample mean, we can check whether adding them together results in the population variance.

Below I show a graph in which I again sampled from our population with varying sample sizes. For each sample, I calculated the sample variance (the biased one) and the variance of the mean of that sample (\(\frac{\sigma^2}n\))^{1}. I did this 1000 times per sample size, took the average of each and put them on top of each other. I also added a dashed line to indicate the variance of the population, which is the benchmark we’re trying to reach.

```
# Calculate the variance sources per sample size
variance_sources <- samples %>%
mutate(var_mean = var_unbiased / n) %>%
group_by(n) %>%
summarize(
var_biased = mean(var_biased),
var_unbiased = mean(var_unbiased),
var_mean = mean(var_mean)
) %>%
pivot_longer(cols = c(var_biased, var_mean), names_to = "variance_source",
values_to = "variance") %>%
mutate(variance_source = recode(variance_source, "var_biased" =
"sample", "var_mean" = "sample mean"))
ggplot(variance_sources, aes(x = n, fill = variance_source, y = variance)) +
geom_col() +
geom_hline(yintercept = sigma, linetype = "dashed", color = yellow) +
scale_fill_manual(values = c(green, yellow)) +
labs(x = "Sample size (n)", y = "Variance", fill = "Variance source:")
```

Indeed, we see that the variance of the sample and the variance of the mean of the sample together form the population variance.

## The Math

Now that we know that the variance of the population consists of the variance of the sample and the variance of the sample mean, we can figure out the correction factor we need to apply to make the biased variance measure unbiased.

Previously, we found an interesting pattern in the simulated samples, which is also visible in the previous figure. We saw that at a sample size \(n=2\), the (biased) sample variance appears to be half that of the (unbiased) population variance. At sample size \(n=3\), it’s \(\frac23\). At sample size \(n=4\), it’s \(\frac34\), and so on.

This means that we can fix the biased variance measure by multiplying it with \(\frac{n}{(n-1)}\). At sample size \(n = 2\), this would mean we multiply the biased variance by \(\frac21 = 2\). For sample size \(n=3\), \(\frac32 = 1.5\). At sample size \(n=4\), \(\frac43 = 1 \frac13\), resulting in the unbiased variance.

In other words, to unbias the biased variance measure, we multiply it by a correction factor of \(\frac{n}{(n-1)}\). But where does this correction factor come from?

Well, because the sample variance misses the variance of the sample mean, we can expect that the variance of the sample is biased by an amount equal to the variance of the population minus the variance of the sample mean. In other words:

\[\sigma^2 - \frac{\sigma^2}n\]

Rewriting this ^{2}, produces:

\[\sigma^2\cdot\frac{n - 1}n\] The variance of a sample will be biased by an amount equal to \(\frac{n - 1}n\). To correct that bias we should multiply the sample variance by the inverse of this bias: \(\frac{n}{n-1}\) ^{3}. This is also called Bessel’s correction.

So, an unbiased measure of our sample variance is the biased sample variance times the correction factor:

\[\frac{\sum(x_i - \overline{x})^2}{n}\cdot{\frac n{n-1}}\] Because the *n* in the denominator of the left term (the biased variance formula) cancels out the *n* in the numerator of the right term (the bias correction), the formula can be rewritten as:

\[\frac{\sum(x_i - \overline{x})^2}{n-1}\]

## Summary

We calculate the variance of a sample by summing the squared deviations of each data point from the sample mean and dividing it by \(n - 1\). The \(n - 1\) actually comes from a correction factor \(\frac n{n-1}\) that is needed to correct for a bias caused by taking the deviations from the sample mean rather than the population mean. Taking the deviations from the sample mean only constitutes the variance around the sample mean, but ignores the variation of the sample mean around the population mean, producing an underestimation equal to the size of the variance of the sample mean: \(\frac{\sigma^2}{n}\). The correction factor corrects for this underestimation, producing an unbiased estimate of the population variance.

*This post was last updated on 2021-06-22.*

Here I cheat a little because in order to calculate the variance of the sample mean, I need to use the unbiased variance formula.↩︎

Here are the steps to rewrite the formula: \[\sigma^2 - \frac{\sigma^2}n\] Add an n to the numerator and denominator of the left term: \[\frac{\sigma^2n}n - \frac{\sigma^2}n\] Combine the terms: \[\frac{\sigma^2n - \sigma^2}n\] Simplify the numerator: \[\frac{\sigma^2(n - 1)}n\] Move \(\sigma^2\) out of the numerator: \[\sigma^2\cdot\frac{n - 1}n\]↩︎

The inverse of \(\frac{n - 1}n\) is \(\frac{1}{\frac{n - 1}n}\). Multiply both the numerator and the denominator by \(n\) and you get \(\frac{n}{n-1}\).↩︎