This tutorial is based on chapter 2 of Pedhazur (1997), which discusses simple linear regression and correlation. Here, I will present the conceptual formulas as in the chapter, but will not use the calculation formulas, as we will be using R to do the calculations.
I will use the following R packages in this tutorial:
library(car)
library(arm)
library(psych)
library(ggplot2)
We will use the data set in that chapter to illustrate the concepts and calculations. Let’s load that data into R. Looking on page 16 of Pedhazur (1997) we see a table with several colums. We are interested in the X
and Y
columns, and want to enter them into a data frame that we will call learn
. I will use the rep
function that repeats numbers to take advantage of the pattern in X
, then just manually enter the Y
variable.
learn <- data.frame(
X = rep(1:5, each = 4),
Y = c(3,5,6,9,4,6,7,10,4,6,8,10,5,7,9,12,7,10,12,6)
)
learn
X Y
1 1 3
2 1 5
3 1 6
4 1 9
5 2 4
6 2 6
7 2 7
8 2 10
9 3 4
10 3 6
11 3 8
12 3 10
13 4 5
14 4 7
15 4 9
16 4 12
17 5 7
18 5 10
19 5 12
20 5 6
# Check data against Table 2.1 sums and means for the two columns.
colSums(learn)
X Y
60 146
colMeans(learn)
X Y
3.0 7.3
These data are created to represent a learning experiment and X
represents the number of hours studied, and Y
represents achievement in mathematics (see the chapter for more information).
The concept of variability is central to understanding statistics in general an linear models specifically. The variance and standard deviation are the two most common measures of variability. Researchers often also want to know how the variability of one variable is related to variability in the other. For this, the covariance and correlation are often used. These concepts are fundamental to understanding simple linear regression.
The variance of a variable is the average squared deviation of the individual scores from the mean score. It is a measure of dispersion, which give us a sense of how far the values in a population tend to fall from the mean value. The larger the variance, the wider the range of values in the distribution.
If, for the vector \(X\) we define another vector of deviation scores, \(x\), as the difference between each value in \(X\) and the mean of \(X\), \(\bar{X}\) as \(x = X - \bar{X}\), then:
\[
s_x^2 = \frac{\Sigma{(X - \bar{X})^2}}{N - 1} = \frac{\Sigma{x^2}}{N- 1} \tag{2.1}
\] Let’s use R to calculate the variance of our X
variable
X <- learn$X
N <- length(X)
First, I created a variable X
which is just a copy of X
in learn
and is the number of hours studied. Then I created a scalar N
which is the sample size in our example, which is 20.
We will use the deviation score to calculate the variance, so I create it here:
# Calculate deviation scores.
x <- X - mean(X)
Finally, we calculate the variance and assign it to the symbol sx2
# Calculate variance using deviation scores with equation (2.1)
sx2 <- sum(x^2)/(N - 1)
# Print variance
sx2
[1] 2.11
We can think of the variance as the average squared deviation of the sample scores from their mean.
We did these “hand” calculations to demonstrate the concepts underlying the variance. We can calculate it much easier with the var()
function:
var(X)
[1] 2.11
The standard deviation is the square root of the variance, so we can calculate it is various ways as follows.
sx <- sqrt((sum(x^2))/(N-1)) # by 'hand'
sx
[1] 1.45
sd(X) # built-in R function
[1] 1.45
The formula for covariance is similar to that of the variance.
\[ s_{xy} = \frac{\Sigma{(X - \bar{X})(Y - \bar{Y})}}{N - 1} = \frac{\Sigma{xy}}{N - 1} \tag{2.3} \]
Y <- learn$Y
y <- Y - mean(Y)
sxy <- sum(x*y)/(N -1)
sxy
[1] 1.58
The last part of the chapter discusses the correlation model. I give the formula for the correlation, followed by the R code hand calculation and built-in function. Notice that I use \(N-1\) instead of \(N\) with the sample estimates of the standard deviations, which is the formula we use in practice (we don’t know the population values, and so must estimate them from the sample).
\[ \rho = \frac{\Sigma{xy}}{(N-1)s_x s_y} \]
sum(x*y)/((N-1)*sd(X)*sd(Y))
[1] 0.416
cor(X, Y)
[1] 0.416
One way to study the covariation of two variables is with simple regression. This methods is best used in an experimental setting where the independent variable (IV) is manipulated and the effect of this manipulation on the dependent variable (DV) is evaluated. The simplest design is a completely randomized design (CRD) with one factor. In relation to the learning experiment in the chapter, this would entail randomly assigning students to one of the five levels of the IV, study time. Then, the covariation of study time with math achievement (DV) would be studied. Often, in such a factorial design the IV would be considered a categorical variable, but in this case, because hours of study time can be thought of as a continuous variable, we will treat it as a continuous variable. This makes sense because study time is on a ratio scale of measurement. But we are making the assumption that the effect of time is linear.
In the first section of this tutorial we entered the data into R manually. We can also import a csv file. I do that here with the data that is part of this project.
learn <- read.csv("data/learn.csv", header = TRUE)
This is a small data set so let’s look at it:
print(learn, row.names = FALSE)
X Y
1 3
1 5
1 6
1 9
2 4
2 6
2 7
2 10
3 4
3 6
3 8
3 10
4 5
4 7
4 9
4 12
5 7
5 10
5 12
5 6
It has two variables X
, the number of hours studied, the IV, and Y
the vector of mathematics achievement scores measured after studying. We want to model the effect of hours of study on math achievement. We expect that studying helps with math achievement, so suspect that as the IV increases we will see a increase in the DV.
First, let’s plot the data to see what we are dealing with.
plot(Y ~ X, data = learn,
xlab = " Hours of Studying (X)",
ylab = "Math Achievment (Y)",
main = "Effect of Hours of Study (IV) on Math Achievement (DV)")
We do see an upward trend in math achievement as the number of hours studies increases. It is always important to look at ALL of your data before you start summarizing it Let’s get some descriptive statistics:
describe(learn, fast = TRUE)
vars n mean sd min max range se
X 1 20 3.0 1.45 1 5 4 0.32
Y 2 20 7.3 2.62 3 12 9 0.59
To model the relation between X and Y we will use a simple linear model, that can be represented as follows:
\[ Y_i = \alpha + \beta X_i + \varepsilon_i \tag{2.5} \] See the chapter for details about this equation. Note that the \(i\) subscript indicates components of the equation that are unique to individual, so each person in the population has a math score and a number of hours studied. Each individual also has a unique error (\(\varepsilon_i\)). We say that the \(i\) subscript indexes individuals in the population. However the intercept (\(\alpha\)) and slope (\(\beta\)) are scalars, meaning single values.
This equation (and the sample equation that follows) decompose the variability in the outcome \(Y_i\) into two components, a systematic component, represented by \(\alpha + \beta X_i\), and an stochastic or error component, represented by \(\varepsilon_i\). The error components is also referred to as the model residuals, as they represent the variability in individual scores not explained by the model.
The sample equation is:
\[ Y_i = a + bX_i + e_i \tag{2.6} \]
and now the \(i\) subscript indexes individuals in the sample instead of the population.
Next, let’s run a simple regression. We use the lm()
function, with the first argument being a formula. In R formulas take the form of an outcome (here Y
) as a function of (represented with the binary operator ~
) the model (here just X
though this part will become more complex as our models become more complex). Type ?formula
in the R console for more information about formulas in R.
mod <- lm(Y ~ X, data = learn)
display(mod)
lm(formula = Y ~ X, data = learn)
coef.est coef.se
(Intercept) 5.05 1.28
X 0.75 0.39
---
n = 20, k = 2
residual sd = 2.45, R-Squared = 0.17
Make sure you can match the coefficients in the above output to the equation at the bottom of page 19. I repeat that equation here (but using \(\hat{Y}\) instead of \(Y'\))
\[ \hat{Y} = 5.05 + 0.75 X \]
We can contrast this with the equation 2.6:
\[ Y_i = a + bX_i + e_i \tag{2.6, repeated} \] Note that the difference between these two equations is, first, that the outcome of the predicted score equation is the predicted score, \(\hat{Y}\), and not \(Y\) itself, as in equation 2.6. Second, note that equation 2.6 includes the error term (i.e. residual)
The Pedhazur text uses \(Y'\), pronounced Y prime, as the symbol for predicted scores, but it is more common to use \(\hat{Y}\), pronounced y hat.
# Predicted scores
learn$yhat <- predict(mod)
# Deviations of predicted scores from the sample mean of Y.
learn$yhatDevs <- learn$yhat - mean(learn$Y)
# Squared deviations of predicted scores from the sample mean of Y.
learn$sqrDevs <- learn$yhatDevs^2
# Residual (Y - Yhat)
learn$resids <- resid(mod)
# Squared residuals
learn$sqrResids <- (resid(mod))^2
round(learn, 3)
X Y yhat yhatDevs sqrDevs resids sqrResids
1 1 3 5.80 -1.50 2.250 -2.80 7.840
2 1 5 5.80 -1.50 2.250 -0.80 0.640
3 1 6 5.80 -1.50 2.250 0.20 0.040
4 1 9 5.80 -1.50 2.250 3.20 10.240
5 2 4 6.55 -0.75 0.563 -2.55 6.502
6 2 6 6.55 -0.75 0.563 -0.55 0.303
7 2 7 6.55 -0.75 0.563 0.45 0.202
8 2 10 6.55 -0.75 0.563 3.45 11.903
9 3 4 7.30 0.00 0.000 -3.30 10.890
10 3 6 7.30 0.00 0.000 -1.30 1.690
11 3 8 7.30 0.00 0.000 0.70 0.490
12 3 10 7.30 0.00 0.000 2.70 7.290
13 4 5 8.05 0.75 0.562 -3.05 9.302
14 4 7 8.05 0.75 0.562 -1.05 1.102
15 4 9 8.05 0.75 0.562 0.95 0.903
16 4 12 8.05 0.75 0.562 3.95 15.603
17 5 7 8.80 1.50 2.250 -1.80 3.240
18 5 10 8.80 1.50 2.250 1.20 1.440
19 5 12 8.80 1.50 2.250 3.20 10.240
20 5 6 8.80 1.50 2.250 -2.80 7.840
round(colSums(learn),3)
X Y yhat yhatDevs sqrDevs resids sqrResids
60.0 146.0 146.0 0.0 22.5 0.0 107.7
ggplot(learn, aes(x = X, y = Y)) + geom_point() +
geom_smooth(method = "lm", level = .89)
`geom_smooth()` using formula 'y ~ x'
We can look at the uncertainty intervals (confidence intervals) of the coefficients.
confint(mod)
2.5 % 97.5 %
(Intercept) 2.3551 7.74
X -0.0626 1.56
The inerval for X is very wide, with some values being negative, suggesting substantial uncertainty in the slope.