Most of the data that we have encountered so far has been relatively “tidy” data. This means that every variable is contained in exactly one column and each observation is contained in exactly one row. In the real world, this is not always the case though. Sometimes people store data in a way that makes sense for humans to read it but it is difficult for computers to read the data. So, throughout this write-up, we will look at how we might go about manipulating our data to get it into a form that makes it easier for us to work with. In order to accomplish this, we will be relying on a few functions from the tidyr package.
Explain what it means for a dataset to be “tidy’’ (one variable per column, one observation per row) and recognize common signs of untidy data.
Use separate() and unite() to split one column into multiple variables and combine multiple variables into one column.
Use pivot_wider() and pivot_longer() to reshape data.
Perform multi-step tidying workflows that combine pivoting and separating to extract multiple variables from column names.
Combine datasets using left_join(), right_join(), inner_join(), and full_join(), and explain how each join changes rows and missing values.
Two functions that might assist us in cleaning up our dataset or arranging it so that it is in a workable format are the separate() and the unite() functions. Before we dive into how they work though, let’s look at the following dataset below. We have the student’s grade along with the date the grade was earned.
grade dates
1 A 12/29/24
2 C 1/22/25
3 B 2/21/25
4 B 2/26/25
5 A 3/1/25
If we wish to break apart a variable then we can use the separate() function. In the example below, we indicate that we want to break apart the dates column into three new variables; month, day, and year. When we do this, we do need to specify the character we want to separate the values at (in this case the forward slash symbol is used). Additionally, we could pass a numeric value as the separator and it would break the value apart at the given numeric position. An example of the separate() function can be seen below:
grade month day year
1 A 12 29 24
2 C 1 22 25
3 B 2 21 25
4 B 2 26 25
5 A 3 1 25
After we have separated the dates column into multiple pieces, we can then alter the columns however we see fit. In the code below we mutate the year column so that it contains the full year designation (we added the “20” to the front of it). The opposite of the separate() function is the unite() function. This function will allow us to bring together multiple columns. In the code below we create a new variable called dates which combines the month, day, and year columns separated by a dash.
grade dates
1 A 12-29-2024
2 C 1-22-2025
3 B 2-21-2025
4 B 2-26-2025
5 A 3-1-2025
Both the separate() and the unite() functions are helpful when we are dealing with messy data that is not formatted properly, as they allow us to break apart and combine columns.
10.2 Pivoting Data
If the data is not “tidy” then it may be difficult to work with. This may be the case when a column contains more than one variable or if an observation is spread over multiple rows. The two main ways to solve this issue is to do a combination of “pivots” on our dataset. The function pivot_wider() will expand two columns into multiple columns (wider across) while pivot_longer() will collapse several columns into two (longer down). While this may seem confusing at first, with practice it should become second nature. We will utilize these functions when we need to reorganize our data so that it is laid out differently.
In the example below we have a dataframe that informs us about an individual, a color, and a count. We could argue that the data is not currently tidy as the individual is spread across multiple observations.
person color count
1 A Red 12
2 A Blue 23
3 A Green 34
4 B Red 45
5 B Blue 56
6 B Green 67
To fix this issue, we can use the pivot_wider() function. This function will convert column values into column names. Because of this, we need to specify where the column names will come from and where the values being used to fill in the columns will come from. In the example above we can see that the color column has “red”, “blue”, and “green” associated with each person. Because of that, it might be computationally beneficial to have “red” “blue”, and “green” as columns. The code below accomplishes that task and organizes it in a way that might be easier to work with.
# A tibble: 2 × 4
person Red Blue Green
<chr> <dbl> <dbl> <dbl>
1 A 12 23 34
2 B 45 56 67
If pivot_wider() expands a column into multiple then pivot_longer() allows us to do the opposite. With the pivot_longer() function, we can indicate which columns we want to gather up and make into two columns. In the code below we specify that the columns “Red”, “Blue”, and “Green” (using the Red:Green argument) should be gathered up and places in a single column. We then specify the name of the new column holding the column headers and the name of the new column holding the associated values. We can see below how the pivot_longer() function is the inverse of the pivot_wider() function.
# A tibble: 6 × 3
person color count
<chr> <chr> <dbl>
1 A Red 12
2 A Blue 23
3 A Green 34
4 B Red 45
5 B Blue 56
6 B Green 67
Another example of data that is not considered “tidy” would be a table like the one seen below. In the table, we have 4 different cities along with their population and vote totals for three different years. While it is easy for a person to understand and interpret, the table it is not easy for a computer to do so. Additionally, the way it is currently set up makes it difficult to computationally compare values between different cities.
City pop_2020 pop_2022 pop_2024 votes_2020 votes_2022 votes_2024
1 A 234 423 742 111 222 444
2 B 358 432 532 222 298 333
3 C 247 285 401 123 103 200
4 D 237 204 243 94 85 100
In order to try to improve the format of this data we can play around with the pivot_longer() and pivot_wider() functions. To start, we can gather up the 6 columns relating to population and voting totals and place them in a single column, with the numeric values assigned to a different column.
# A tibble: 24 × 3
City Characteristic Count
<chr> <chr> <dbl>
1 A pop_2020 234
2 A pop_2022 423
3 A pop_2024 742
4 A votes_2020 111
5 A votes_2022 222
6 A votes_2024 444
7 B pop_2020 358
8 B pop_2022 432
9 B pop_2024 532
10 B votes_2020 222
# ℹ 14 more rows
In the output above we have a column relating to the population and voting totals but the year is attached to each of those values. Because of this, we can separate the column so that we have the year in its own column.
# A tibble: 24 × 4
City Trait Year Count
<chr> <chr> <chr> <dbl>
1 A pop 2020 234
2 A pop 2022 423
3 A pop 2024 742
4 A votes 2020 111
5 A votes 2022 222
6 A votes 2024 444
7 B pop 2020 358
8 B pop 2022 432
9 B pop 2024 532
10 B votes 2020 222
# ℹ 14 more rows
You may be uncertain what pop stands for, so we can always substitute the value with the word population using the sub() function to make it more clear what the value represents.
# A tibble: 5 × 4
City Trait Year Count
<chr> <chr> <chr> <dbl>
1 A population 2020 234
2 A population 2022 423
3 A population 2024 742
4 A votes 2020 111
5 A votes 2022 222
Because there are multiple variables in a single column, with the Trait column containing data associated with population and votes, we can use the function and create columns with those headers instead. This will allow us to have each population and the vote total attached to each city and year combination.
# A tibble: 12 × 4
City Year population votes
<chr> <chr> <dbl> <dbl>
1 A 2020 234 111
2 A 2022 423 222
3 A 2024 742 444
4 B 2020 358 222
5 B 2022 432 298
6 B 2024 532 333
7 C 2020 247 123
8 C 2022 285 103
9 C 2024 401 200
10 D 2020 237 94
11 D 2022 204 85
12 D 2024 243 100
10.3 Joining Data
Another common way that we may clean and re-format data is by “joining” multiple datasets together. In the example below we have 2 different datasets, and we will see the different ways that we can combine them together. Notice that the two datasets have different columns and observations.
person id age
1 A 123 27
2 B 456 63
3 C 789 42
4 D 555 31
df2
person id gender
1 A 123 Male
2 B 456 Female
3 E 321 Female
The first method that we will discuss is the left_join() function. To use this function we will pass in two different datasets, and it will take the values from the “left-hand” dataset and try to match them up with the values in the “right-hand” dataset. Notice that all of the values from the “right” dataset are not present as they were unable to be matched up to the “left” dataset. Missing values are present for some observations since persons “C” and “D” were not present in the “right” dataset and thus not assigned a “gender”.
left_join(df1, df2)
Joining with `by = join_by(person, id)`
person id age gender
1 A 123 27 Male
2 B 456 63 Female
3 C 789 42 <NA>
4 D 555 31 <NA>
The right_join() function works exactly the same as the left_join() but instead uses the values in the “right” dataframe and tries to match them with the “left” dataset’s values.
right_join(df1,df2)
Joining with `by = join_by(person, id)`
person id age gender
1 A 123 27 Male
2 B 456 63 Female
3 E 321 NA Female
The inner_join() function only contains the observations present in both of the datasets while the full_join() function retains all of the observations and columns.
inner_join(df1,df2)
Joining with `by = join_by(person, id)`
person id age gender
1 A 123 27 Male
2 B 456 63 Female
full_join(df1,df2)
Joining with `by = join_by(person, id)`
person id age gender
1 A 123 27 Male
2 B 456 63 Female
3 C 789 42 <NA>
4 D 555 31 <NA>
5 E 321 NA Female
All of these functions have benefits and downsides, and it is up to you the user to determine which one to use. Occasionally you may want to have all of the information available while other times you might be working with a dataset and want to match your observations with values from a large dataset that contains more observations than you have, and thus a “left-join” might be more helpful.
The examples above deal with joining columns, but we can do something similar with observations as well. In the example below the intersect() function outputs the rows that appear in both datasets while the union() function will display all of the rows that appear in either of the datasets.
df3 <-data.frame(Person =c("A", "B", "C"), Age =c(20, 24, 25))df4 <-data.frame(Person =c("C", "D"), Age =c(25, 26))intersect(df3, df4)