Analysing Longitudinal Data

Conceptual Issues

First, I list some issues to think about with longitudinal modeling. With most longitudinal data, we have time points or occasions nested within individuals. So, unlike previous data we look at, individuals are now level 2 instead of level 1.

Fixed and Varying Occasions

Time can be measured at fixed intervals, in which every individual is measured at the same intervals and or times. Alternatively, individuals can be measured at varying times. In this post we will look at longitudinal data with fixed occasions..

Advantages of Multilevel Longitudinal Modeling

Recall from the text that there are advantages to collecting longitudinal data over cross-sectional data. These include:

greater power with additional time points
the ability to look at how the outcome changes (patterns of change) and not just whether they change.

There are a number of advantages to using multilevel models to model longitudinal data. These include:

The ability to model growth curves that are different for each individual
The number and spacing of occasions can vary between individuals
The covariances between occasions can be explicitly modeled.
Repeated measures ANOVA is a special (and more restrictive) case of MLM (suggesting that we should start with the less restrictive model: MLM)
I is easy to add higher levels to model
It is straight-forward to add time-varying or time-constant explanatory variables to the model.

But, the measures must be calibrated across time points.

Example with Fixed Occasions

For this example, we are using data from chapter 5 of Hox, Moerbeek, and van de Schoot (2018). The data contain the college GPA of 200 college students collected over six occasions. A time varying covariate, the job status of each student, was also collected at the six occasions. Two time invariant variables, high school GPA and the students gender were also collected. There is another outcome variable named admitted, that is dichotomous and indicates whether the student was admitted to the university of their choice. We will not use this last variable in this post.

The purpose of this example study is to study the impact of having a job on college GPA. But, we will want to control for gender and high school GPA in our analyses.

First, we need to load the correct data. There are at least two ways we can structure Load all packages to be used. Note that the data for this project are in SPSS format, so you my need to install the haven package which allow for importing SPSS, Stata, and SAS data files into R. So, you may need to run install.packages('haven') before running the code chunk below.

# install the haven package which allows importing SPSS data, among other 
# formats (Stata, SAS)
library(haven) 
library(lme4)
library(lmerTest)
library(ggplot2)
library(texreg)
library(performance)
library(interactions)

Import Data

Import the gpa2long.sav file from the data directory of the longidtudinal_pgaCurran project.

gpa2long <- read_sav("data/gpa2long.sav")

Note that the variables in this data have preserved the SPSS labeling, which can be used to determine the factor labels. To see this use the str() function on the data frame.

Clean Data

The data is in good shape, all we need to do is convert the categorical variables to factors. Here, I leave occasion as numeric, to reflect the assumption that this variable measure evenly spaced time points, that start with zero at the first time point.

gpa2long <- within(gpa2long, {
  student <- factor(student)
  gender <- factor(sex, labels = c("male", "female"))
  admitted <- factor(admitted, labels = c("no", "yes"))
})

Explore Data

This first plot is a spaghetti plot, which just plots the outcome at each occasion with the points for each individual connected with a line. Its hard to see what is going on with each individual, but we can see a general trend in increases in GPA across occasion.

ggplot(gpa2long, aes(y = gpa, x = occas, group = student)) + 
  geom_line()

Another way to look at the general trend is to get a smoothed line summarizing the trend across individuals.

ggplot(gpa2long, aes(y = gpa, x = occas)) +
  geom_point() + geom_smooth(method = "loess")

The loess smoothing algorithm allows for a local average curve to be fit to the data, which would allow for a curvy line. The fact that the line looks very straight suggest a linear relation between gpa across occasions. This plot also demonstrate the fixed nature of the occasions, as they line up in descrete columns.

We can also look at the linear relation across occasions for each individual.

ggplot(gpa2long, aes(y = gpa, x = occas, group = student)) + 
  geom_smooth(method = "lm", se = FALSE)

Here we see that descriptively our sample demonstrates differences in the estimated slopes across individuals. There even seem to be some that are negative.

We can look at the mean GPA across time in a few ways:

boxplot(gpa ~ occas, gpa2long)

This plot actually shows the medians and other quartiles, but we see the upward trend in GPA.

Next, I calculate the mean GPA for each occasion, and print and plot them.

gpameans_occas <- aggregate(gpa ~ occas, gpa2long, mean)
gpameans_occas

  occas    gpa
1     0 2.5935
2     1 2.7155
3     2 2.8100
4     3 2.9180
5     4 3.0190
6     5 3.1340

plot(gpa ~ occas, gpameans_occas)

Exercise

Create some more plots exploring the relation of key variables with other variables in the data.
How would you summarize what you have learned about the data to this point?
Obtain some summary statistics of the data, now that you have a sense of data as a whole. Based on what you learned from the graphic exploration, are these statistics reasonable summaries of this data?

Models

Now we are ready to run a series of model of increasing complexity.

mod0 <- lmer(gpa ~ 1 + (1 | student), gpa2long, REML = FALSE)
modtime <- lmer(gpa ~ occas + (1 | student), gpa2long, REML = FALSE)
modtjob <- lmer(gpa ~ occas + job + (1 | student), gpa2long, REML = FALSE)
modfull <- lmer(gpa ~ occas + job + highgpa + gender + (1 | student), gpa2long,
                REML = FALSE)
summary(modfull)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: gpa ~ occas + job + highgpa + gender + (1 | student)
   Data: gpa2long

     AIC      BIC   logLik deviance df.resid 
   296.8    332.4   -141.4    282.8     1193 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.92840 -0.59427 -0.02739  0.63407  2.96127 

Random effects:
 Groups   Name        Variance Std.Dev.
 student  (Intercept) 0.04497  0.2121  
 Residual             0.05514  0.2348  
Number of obs: 1200, groups:  student, 200

Fixed effects:
               Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)     2.64147    0.09752  289.77132  27.086  < 2e-16 ***
occas           0.10245    0.00399  994.90795  25.679  < 2e-16 ***
job            -0.17221    0.01806 1100.57001  -9.534  < 2e-16 ***
highgpa         0.08470    0.02776  194.17579   3.051   0.0026 ** 
genderfemale    0.14725    0.03305  194.13594   4.455 1.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) occas  job    highgp
occas       -0.145                     
job         -0.430  0.102              
highgpa     -0.875  0.003  0.029       
genderfemal -0.251  0.003  0.027  0.073

htmlreg(list(mod0, modtime, modtjob, modfull), digits = 3, 
        custom.model.names = c("$M_1$: null model", "$M_2$:+occas.",
                               "$M_3$:+jobsat", "$M_4$:+gpa, gender"))

Statistical models
	\(M_1\): null model	\(M_2\):+occas.	\(M_3\):+jobsat	\(M_4\):+gpa, gender
(Intercept)	2.865^***	2.599^***	2.970^***	2.641^***
	(0.019)	(0.022)	(0.044)	(0.098)
occas		0.106^***	0.102^***	0.102^***
		(0.004)	(0.004)	(0.004)
job			-0.171^***	-0.172^***
			(0.018)	(0.018)
highgpa				0.085^**
				(0.028)
genderfemale				0.147^***
				(0.033)
AIC	919.456	401.649	318.399	296.760
BIC	934.726	422.009	343.849	332.390
Log Likelihood	-456.728	-196.825	-154.200	-141.380
Num. obs.	1200	1200	1200	1200
Num. groups: student	200	200	200	200
Var: student (Intercept)	0.057	0.063	0.052	0.045
Var: Residual	0.098	0.058	0.055	0.055
p < 0.001; p < 0.01; p < 0.05

Testing Nonlinear trends in GPA growth

modtimepoly <- lmer(gpa ~ poly(occas, 2) + (1 | student), gpa2long, REML = FALSE)
summary(modtimepoly)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: gpa ~ poly(occas, 2) + (1 | student)
   Data: gpa2long

     AIC      BIC   logLik deviance df.resid 
   403.6    429.1   -196.8    393.6     1195 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6165 -0.6382 -0.0012  0.6389  2.8352 

Random effects:
 Groups   Name        Variance Std.Dev.
 student  (Intercept) 0.06336  0.2517  
 Residual             0.05803  0.2409  
Number of obs: 1200, groups:  student, 200

Fixed effects:
                  Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)        2.86500    0.01911  200.00000 149.927   <2e-16 ***
poly(occas, 2)1    6.28964    0.24089 1000.00000  26.109   <2e-16 ***
poly(occas, 2)2   -0.01389    0.24089 1000.00000  -0.058    0.954    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) p(,2)1
ply(ccs,2)1 0.000        
ply(ccs,2)2 0.000  0.000

Adding Random Slopes

modfullrs <- lmer(gpa ~ occas + job + I(highgpa - mean(highgpa)) + gender + (occas |student), 
                  gpa2long, REML = FALSE)
summary(modfullrs)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: gpa ~ occas + job + I(highgpa - mean(highgpa)) + gender + (occas |  
    student)
   Data: gpa2long

     AIC      BIC   logLik deviance df.resid 
   188.1    233.9    -85.1    170.1     1191 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0175 -0.5379 -0.0002  0.5431  3.3656 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 student  (Intercept) 0.038233 0.19553       
          occas       0.003837 0.06194  -0.21
 Residual             0.041542 0.20382       
Number of obs: 1200, groups:  student, 200

Fixed effects:
                             Estimate Std. Error         df t value Pr(>|t|)
(Intercept)                 2.822e+00  4.474e-02  1.034e+03  63.077  < 2e-16
occas                       1.034e-01  5.586e-03  1.996e+02  18.506  < 2e-16
job                        -1.311e-01  1.726e-02  1.038e+03  -7.595 6.84e-14
I(highgpa - mean(highgpa))  8.854e-02  2.628e-02  1.976e+02   3.369 0.000907
genderfemale                1.157e-01  3.130e-02  1.978e+02   3.696 0.000284
                              
(Intercept)                ***
occas                      ***
job                        ***
I(highgpa - mean(highgpa)) ***
genderfemale               ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) occas  job    I(-m()
occas       -0.227                     
job         -0.846  0.069              
I(-mn(hgh)) -0.045  0.002  0.022       
genderfemal -0.392  0.002  0.030  0.073

htmlreg(list(modtime, modfullrs), 
        custom.model.names = c("$M_5$:+occas rand", 
                               "$M_6$:+cross-level int."))

Statistical models
	\(M_5\):+occas rand	\(M_6\):+cross-level int.
(Intercept)	2.60^***	2.82^***
	(0.02)	(0.04)
occas	0.11^***	0.10^***
	(0.00)	(0.01)
job		-0.13^***
		(0.02)
highgpa - mean(highgpa)		0.09^***
		(0.03)
genderfemale		0.12^***
		(0.03)
AIC	401.65	188.12
BIC	422.01	233.93
Log Likelihood	-196.82	-85.06
Num. obs.	1200	1200
Num. groups: student	200	200
Var: student (Intercept)	0.06	0.04
Var: Residual	0.06	0.04
Var: student occas		0.00
Cov: student (Intercept) occas		-0.00
p < 0.001; p < 0.01; p < 0.05

Test the addition of the random slope for occasions.

ranova(modfullrs)

ANOVA-like table for random-effects: Single term deletions

Model:
gpa ~ occas + job + I(highgpa - mean(highgpa)) + gender + (occas | 
    student)
                           npar   logLik    AIC    LRT Df Pr(>Chisq)    
<none>                        9  -85.059 188.12                         
occas in (occas | student)    7 -141.380 296.76 112.64  2  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(modfull, modfullrs) # equivalent

Data: gpa2long
Models:
modfull: gpa ~ occas + job + highgpa + gender + (1 | student)
modfullrs: gpa ~ occas + job + I(highgpa - mean(highgpa)) + gender + (occas | 
modfullrs:     student)
          npar    AIC    BIC   logLik deviance  Chisq Df Pr(>Chisq)    
modfull      7 296.76 332.39 -141.380   282.76                         
modfullrs    9 188.12 233.93  -85.059   170.12 112.64  2  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Cross-level Interaction occassion*gender

modfullrsx <- lmer(gpa ~ occas + job + I(highgpa - mean(highgpa)) + gender + 
                                          gender:occas + 
                     (occas |student), 
                  gpa2long, REML = FALSE)

htmlreg(list(modfull, modfullrs, modfullrsx), digits = 3)

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	2.641^***	2.822^***	2.845^***
	(0.098)	(0.045)	(0.045)
occas	0.102^***	0.103^***	0.088^***
	(0.004)	(0.006)	(0.008)
job	-0.172^***	-0.131^***	-0.132^***
	(0.018)	(0.017)	(0.017)
highgpa	0.085^**
	(0.028)
genderfemale	0.147^***	0.116^***	0.076^*
	(0.033)	(0.031)	(0.035)
highgpa - mean(highgpa)		0.089^***	0.089^***
		(0.026)	(0.026)
occas:genderfemale			0.030^**
			(0.011)
AIC	296.760	188.117	182.971
BIC	332.390	233.928	233.872
Log Likelihood	-141.380	-85.059	-81.486
Num. obs.	1200	1200	1200
Num. groups: student	200	200	200
Var: student (Intercept)	0.045	0.038	0.038
Var: Residual	0.055	0.042	0.042
Var: student occas		0.004	0.004
Cov: student (Intercept) occas		-0.002	-0.002
p < 0.001; p < 0.01; p < 0.05

Get correlation between variance components

VarCorr(modfullrsx)

 Groups   Name        Std.Dev. Corr  
 student  (Intercept) 0.194449       
          occas       0.060118 -0.188
 Residual             0.203851

Figure 5.4, p. 82.

interact_plot(modfullrsx, occas, gender)

sim_slopes(modfullrsx, occas, gender)

Warning: Johnson-Neyman intervals are not available for factor moderators.

SIMPLE SLOPES ANALYSIS 

Slope of occas when gender = female: 

  Est.   S.E.   t val.      p
------ ------ -------- ------
  0.12   0.01    15.53   0.00

Slope of occas when gender = male: 

  Est.   S.E.   t val.      p
------ ------ -------- ------
  0.09   0.01    11.05   0.00

What about smaller sample size (at level 2)?

# create a smaller data set by sampling individuals from the full data:
n <- 12 # level 2 sample size. You can change this to play around with n.
# Index of sampled students
set.seed(20200702) # set the random seed for reproducibility
sampidx <- as.numeric(sample(unique(gpa2long$student), replace = FALSE, size = n)) 

# Create smaller sample data
gpa2long12 <- gpa2long[gpa2long$student %in% sampidx, ]

References

Hox, J.J., Moerbeek, M. and Van de Schoot, R. (2018), Multilevel Analysis: Techniques and applications, Sage, New York, NY.