While central tendency is important, we often want to understand more than just were the center of a distribution is located, we often also find useful measures of the spread of the values around the center. Such statistics are referred to as measures of dispersion or variability. Common measures I will discuss in this tutorial include the range, interquartile range, variance, and standard deviation.
I use the following packages in this tutorial.
library(psych)
library(ggplot2)
Scientific notation, while often confusing and frustrating initially, is very useful in helping to convey complex ideas in a compact and precise manner. The table below contains scientific notations relevant to measures of variability.
Symbol | meaning |
---|---|
\(IQR\) | Interquartile range |
\(\sigma^2\) | Population variance |
\(s^2\) | sample variance |
\(\sigma\) | Population standard deviation |
\(s\) | sample standard deviation |
We need some way to quantify the variability of the scores in distributions that reflects all scores and is sensitive to outliers. We already have a single score for the average or “typical” score - the mean. Ideally, we want a single number that represents the typical variation from the mean. The range is the simplest way to describe variability or how scores are dispersed across possible values. The range is the difference between the highest and lowest values in the variable.
Range = Highest - Lowest
GRE scores for two classes with the same mean (151.3) and same range (40):
set.seed(1234)
x1 <- rnorm(1000, 151.3, 8)
x1 <- x1[x1 > 130 & x1 < 170]
x2 <- c(rnorm(length(x1)-2, 151.3, 3), 130, 170)
par(mfrow = c(1,2))
hist(x1, col = "skyblue", xlim = c(130, 170), main = NULL)
abline(v = mean(x1), col = "red", lwd = 2)
hist(x2, col = "skyblue", xlim = c(130, 170), main = NULL, breaks = 15)
abline(v = mean(x2), col = "red", lwd = 2)
par(mfrow = c(1, 1))
range(x1)
[1] 130 169
range(x2)
[1] 130 170
For example, we can get the quartiles of using the quantiles()
function in R. Here I do this for x1
above.
quantile(x1)
0% 25% 50% 75% 100%
130 146 151 156 169
quantile(x2)
0% 25% 50% 75% 100%
130 149 151 153 170
The interquartile range is simply the range of values from the \(25^{th}\) and \(75^{th}\) quartiles. A benefit of this measures is that is is not as sensitive to outliers as the range. For the above example, the interquartile range is 146 - 156. Notice that the \(50^{th}\) quantile, which is the second quartile, is the median:
quantile(x1, probs = c(.25, .75))
25% 75%
146 156
median(x1)
[1] 151
A major limitation of the range and interquartile range is that they are calculated with only two values in the data. What we really want is a measure of dispersion that takes into consideration all the values in the variable. The variance does this. The variance is a very important measure of variability. It is also closely related to the standard deviation, which we talk about next. In words the variance is the average squared deviation But let’s try to understand why we calculate the variance the way we do. To do that we need to understand deviations.
We can calculate the distance between the mean and each score. These distances are called deviations. Each score has a deviation. I have plotted the histogram of the deviations for each of the two classes GRE scores below.
par(mfrow = c(1,2))
d1 <- x1 - mean(x1)
d2 <- x2 - mean(x2)
hist(d1, col = "skyblue", xlim = c(-30, 30), main = NULL)
abline(v = mean(d1), col = "red", lwd = 2)
hist(d2, col = "skyblue", xlim = c(-30, 30), main = NULL, breaks = 15)
abline(v = mean(d2), col = "red", lwd = 2)
par(mfrow = c(1, 1))
We might think to just take the mean of the deviations as a measure of the average or “typical” distance from the mean for each set of scores. But, the mean deviation for the first score is 0 and the mean deviation for the second set is 0. This is because, being a measure of central tendency, the mean is in the middle, and the positive distances of scores above the mean cancel out the negative distances below the mean.
We could take the mean of the absolute values of the deviations, but a more ingenious solution is to square the deviations.
This does two things:
The sum of the squared deviations is often referred to simply as the “sum of squares”, and symbolized as \(SS\). The sum of squares is not very meaningful by itself, so we often calculate the mean squared deviation, by dividing the \(SS\) by \(N\). This give us:
\[ \sigma^2 = \frac{SS}{N} = \frac{\Sigma(x - \mu)^2}{N} \]
\[ s^2 = \frac{\Sigma{(x - \bar{X})^2}}{n - 1} \]
\[ \mu = \frac{\Sigma x}{N}, \quad \sigma^2 = \frac{\Sigma(x - \mu)^2}{N}. \]
Comparing the formulae for the mean and variance makes clear that the variance is the mean squared deviation.
If the variance is the mean squared distance from the mean, then taking the square root of the variance gives us the mean, or average, distance from the mean.
\[ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{SS}{N}} = \sqrt{\frac{\Sigma{(x -\mu)^2}}{N}}. \]
\[ s = \sqrt{s^2} = \sqrt{\frac{SS}{n-1}} = \sqrt{\frac{\Sigma{(x - M)^2}}{n-1}}. \]
par(mfrow = c(1,2))
hist(x1, col = "skyblue", xlim = c(130, 170), main = NULL)
abline(v = mean(x1), col = "red", lwd = 2)
hist(x2, col = "skyblue", xlim = c(130, 170), main = NULL, breaks = 15)
abline(v = mean(x2), col = "red", lwd = 2)
par(mfrow = c(1, 1))
The variance of x1 is 54.71 and the variance of x2 is 9.51. The standard deviation of x1 is 7.4 and the standard deviation of x2 is 3.08.
d <- data.frame(x1, x2)
describe(d, fast = TRUE)
vars n mean sd min max range se
x1 1 980 151 7.4 130 169 39 0.24
x2 2 980 151 3.1 130 170 40 0.10