tidyr
There are multiple ways to represent the same data in a data frame. These different ways are needed for different types of analytic procedures, and may vary across software implementations. A common example is when dealing with longitudinal data, where you often have the same variable measured at different time points. For example, let’s say you have measured college students gpa at six time points. This data could be structured so each row represents a student, and there are separate columns for each time point with the measured gpa. This is often referred to as the wide format. You could also structure the data so that all the gpa values were in one column, and there was an additional column the contained the six time points. This is often referred to as long format, and in our example, each student would have six rows, one for each time point.
In this tutorial, I will show you how to go from wide to long format using the tidyr
package. There are other ways to do this in R other than using this package. Base R has a function reshape()
that can be used, but after 10 years of using R I still have to look up how to use it each time I try. There is also a way to do it with the data.table
package which is useful, particularly if you have a large data frame. But in my opinion, tidyr
is the easiest to understand, so I will show that here.
the tidyr
package is all about making data tidy. Go to https://tidyr.tidyverse.org/ to learn more. It is part if the tidyverse, which is a set of popular packages for data science in R. You can learn more about the tidyverse here: https://www.tidyverse.org/.
There are four functions in the tidyr
package most relevant for reshaping data from wide to long or from long to wide. They are gather()
, and its complement spread()
, as well as separate()
and its complement unite()
. I will demonstrate the first three here.
I will demonstrate converting data from wide to long using the gpa data from the Hox, Moerbeek, and van de Schoot 2018) text. You can find a description of this data in an appendix to that book or here under “CHAPTER 5-GPA DATA”. Reading the description will help you understand what I am doing.
You can import the data from the website above, or import the csv version from the TutoR github site, as I do below.
gpadat <- read.csv("../data/gpa.csv")
After creating a data frame called gpadat
we can look at its structure.
str(gpadat)
'data.frame': 200 obs. of 16 variables:
$ student : int 1 2 3 4 5 6 7 8 9 10 ...
$ sex : int 1 0 1 0 0 1 0 1 0 0 ...
$ highgpa : num 2.8 2.5 2.5 3.8 3.1 2.9 2.3 3.9 2 2.8 ...
$ gpa1 : num 2.3 2.2 2.4 2.5 2.8 2.5 2.4 2.8 2.8 2.8 ...
$ gpa2 : num 2.1 2.5 2.9 2.7 2.8 2.4 2.4 2.8 2.7 2.8 ...
$ gpa3 : num 3 2.6 3 2.4 2.8 2.4 2.8 3.1 2.7 3 ...
$ gpa4 : num 3 2.6 2.8 2.7 3 2.3 2.6 3.3 3.1 2.7 ...
$ gpa5 : num 3 3 3.3 2.9 2.9 2.7 3 3.3 3.1 3 ...
$ gpa6 : num 3.3 2.8 3.4 2.7 3.1 2.8 3 3.4 3.5 3 ...
$ job1 : int 2 2 2 3 2 2 3 2 2 2 ...
$ job2 : int 2 3 2 2 2 3 2 2 2 2 ...
$ job3 : int 2 2 2 2 2 3 3 2 3 2 ...
$ job4 : int 2 2 3 2 2 2 2 2 2 3 ...
$ job5 : int 2 2 2 2 2 3 2 2 2 2 ...
$ job6 : int 2 2 2 2 2 3 2 2 2 2 ...
$ admitted: int 1 0 1 0 1 0 1 1 NA 1 ...
You can see that the data has two variables that are measure at each of six time points, gpa and job, the latter being the student’s job status at the time of measurement. These are considered time variant variables. The other variables are time invariant as where not measure at each time point because they do not vary over the time of the study. This data is in wide format. We want to get it in long format, so that we would have one variable for gpa and another for job, plus a new variable time that captures the six time points.
gather()
The gather()
function takes variables represented in multiple columns and gathers them into one column of values and a column of keys to identify which original column they came from. For now, lets see how we would use this function if we only had one time varying variable. I will remove the job variables from our data to demonstrate this.
gpaonly <- subset(x = gpadat , select = -c(job1, job2, job3, job4, job5, job6))
str(gpaonly)
'data.frame': 200 obs. of 10 variables:
$ student : int 1 2 3 4 5 6 7 8 9 10 ...
$ sex : int 1 0 1 0 0 1 0 1 0 0 ...
$ highgpa : num 2.8 2.5 2.5 3.8 3.1 2.9 2.3 3.9 2 2.8 ...
$ gpa1 : num 2.3 2.2 2.4 2.5 2.8 2.5 2.4 2.8 2.8 2.8 ...
$ gpa2 : num 2.1 2.5 2.9 2.7 2.8 2.4 2.4 2.8 2.7 2.8 ...
$ gpa3 : num 3 2.6 3 2.4 2.8 2.4 2.8 3.1 2.7 3 ...
$ gpa4 : num 3 2.6 2.8 2.7 3 2.3 2.6 3.3 3.1 2.7 ...
$ gpa5 : num 3 3 3.3 2.9 2.9 2.7 3 3.3 3.1 3 ...
$ gpa6 : num 3.3 2.8 3.4 2.7 3.1 2.8 3 3.4 3.5 3 ...
$ admitted: int 1 0 1 0 1 0 1 1 NA 1 ...
You can see we only have one variable that varys across time. To go from wide to long first gather the columns:
library(tidyr)
gpaonlylong <- gather(data = gpaonly, key = time, value = gpa, gpa1:gpa6)
headTail(gpaonlylong)
student sex highgpa admitted time gpa
1 1 1 2.8 1 gpa1 2.3
2 2 0 2.5 0 gpa1 2.2
3 3 1 2.5 1 gpa1 2.4
4 4 0 3.8 0 gpa1 2.5
... ... ... ... ... <NA> ...
1197 197 1 2.1 <NA> gpa6 3.8
1198 198 0 4 <NA> gpa6 3.4
1199 199 1 2.3 1 gpa6 2.8
1200 200 0 3.4 <NA> gpa6 3.4
This code creates a column we named in the key
argument, which we called time, that captures the variable names for each column gathered. It also creates a column from the value
argument, which we named gpa that contains the gpa values from those six columns.
If we only had one time varying variable we could just use the gather()
and separate()
functions.
library(tidyr)
gpadatlong <- gather(gpadat, key = key, value = val, gpa1:job6)
headTail(gpadatlong, 10)
student sex highgpa admitted key val
1 1 1 2.8 1 gpa1 2.3
2 2 0 2.5 0 gpa1 2.2
3 3 1 2.5 1 gpa1 2.4
4 4 0 3.8 0 gpa1 2.5
5 5 0 3.1 1 gpa1 2.8
6 6 1 2.9 0 gpa1 2.5
7 7 0 2.3 1 gpa1 2.4
8 8 1 3.9 1 gpa1 2.8
9 9 0 2 <NA> gpa1 2.8
10 10 0 2.8 1 gpa1 2.8
... ... ... ... ... <NA> ...
2397 197 1 2.1 <NA> job6 2
2398 198 0 4 <NA> job6 2
2399 199 1 2.3 1 job6 3
2400 200 0 3.4 <NA> job6 1
separate()
gpadatlong <- separate(gpadatlong,
col = key,
into = c("variable", "time"),
sep = 3)
headTail(gpadatlong)
student sex highgpa admitted variable time val
1 1 1 2.8 1 gpa 1 2.3
2 2 0 2.5 0 gpa 1 2.2
3 3 1 2.5 1 gpa 1 2.4
4 4 0 3.8 0 gpa 1 2.5
... ... ... ... ... <NA> <NA> ...
2397 197 1 2.1 <NA> job 6 2
2398 198 0 4 <NA> job 6 2
2399 199 1 2.3 1 job 6 3
2400 200 0 3.4 <NA> job 6 1
spread()
gpadatlong <- spread(gpadatlong, key = variable, value = val)
headTail(gpadatlong, 7)
student sex highgpa admitted time gpa job
1 1 1 2.8 1 1 2.3 2
2 1 1 2.8 1 2 2.1 2
3 1 1 2.8 1 3 3 2
4 1 1 2.8 1 4 3 2
5 1 1 2.8 1 5 3 2
6 1 1 2.8 1 6 3.3 2
7 2 0 2.5 0 1 2.2 2
... ... ... ... ... <NA> ... ...
1197 200 0 3.4 <NA> 3 3.4 2
1198 200 0 3.4 <NA> 4 3.5 2
1199 200 0 3.4 <NA> 5 3.3 1
1200 200 0 3.4 <NA> 6 3.4 1
%>%
)Below is a code snippet that does all that was just covered using the pipe operator.
gpadatlong <- gpadat %>%
gather(key = key, value = value, gpa1:job6) %>%
separate(col = key, into = c("variable", "time"), sep = 3) %>%
spread(key = variable, value = value)
library(ggplot2)
Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':
%+%, alpha
ggplot(gpadatlong, aes(x = time, y = jitter(gpa), group = student)) + geom_line(alpha = .4)