library(openintro)
aggregate(duke_forest$price, by=list(num_beds=duke_forest$bed), mean) num_beds x
1 2 349250.0
2 3 491650.0
3 4 570982.1
4 5 707500.0
5 6 1250000.0
When manipulating data, it is often useful to select certain observations, order the data in a certain way, generate new variables, or summarize the information. We have previously seen how to filter, arrange, and mutate variables within our dataset using the tools available within the dplyr library. In this lecture, we will see how we can summarize the information we are given and calculate statistics based on a variety of groupings. To summarize our data based on multiple variables (bi-variate data) we will use the tools within the dplyr library to carry out this process.
group_by() and summarise() to compute descriptive statistics for each category in a dataset.n(), tally(), and count() to count observations within groups and interpret those counts.ungroup() matters and predict how grouped vs. ungrouped operations change results.arrange()/filter() to answer comparison questions.n_distinct(), the slice() family, ranking functions, and lag()/lead() for more advanced grouped analysis.Within base R we can break a dataset up by which group a variable is in and then calculate some statistic for it (like the mean, median, standard deviation, etc.). To do this, we used the aggregate() function and list the groups that we wanted to partition the dataset into before calculating the statistic. We can pass multiple grouping variables into the list() argument to break the dataset up into more specific groupings. An example of this process can be seen below where we calculate the average home price for houses with either 2, 3, etc. bedrooms.
library(openintro)
aggregate(duke_forest$price, by=list(num_beds=duke_forest$bed), mean) num_beds x
1 2 349250.0
2 3 491650.0
3 4 570982.1
4 5 707500.0
5 6 1250000.0
A similar process could be carried out using the tools available in the dplyr library. The group_by() function will partition the dataset into multiple groups thus allowing us to carry out calculations on different groups. The summarise() function then summarizes (…who would have thought) each group based on what we tell it to do. When both functions are used together it will output 1 row for each grouped value, which is useful when calculating the descriptive statistics among groups.
The group_by() and summarise() functions working together will allow us to essentially replicate the aggregate() and apply() functions in base R. Note that the summarize() function will do the same thing, but it is sometimes masked by another library (meaning another library has a function of the same name and it sometimes “overrides” the dplyr function depending on the order the libraries are loaded), so summarise() is preferred.
library(dplyr)
duke_forest |>
group_by(bed) |>
summarise(mean_price = mean(price)) |>
ungroup()# A tibble: 5 × 2
bed mean_price
<dbl> <dbl>
1 2 349250
2 3 491650
3 4 570982.
4 5 707500
5 6 1250000
In the output above we can see that we have replicated the aggregate() function’s results. The way we have done it in dplyr allows us to have more flexibility though, as we can add more information to the output. For instance, if we want to count the number of observations in each group then within the summarise() function we can specify n(). An example of this can be seen below where we both count the number of observations for each category and calculate the mean price for each category.
duke_forest |>
group_by(bed) |>
summarise(n=n(),
mean_price = mean(price)) |>
ungroup()# A tibble: 5 × 3
bed n mean_price
<dbl> <int> <dbl>
1 2 4 349250
2 3 30 491650
3 4 52 570982.
4 5 11 707500
5 6 1 1250000
If one wishes to just determine how many observations are in each category then the tally() function will do the trick if we first group_by() the variables we want to count the number of observations for. The count() function is also a possibility for a quick count and does not require us to first group by a specific variable.
duke_forest |>
group_by(bed) |>
tally() |>
ungroup()# A tibble: 5 × 2
bed n
<dbl> <int>
1 2 4
2 3 30
3 4 52
4 5 11
5 6 1
duke_forest |>
count(bed)# A tibble: 5 × 2
bed n
<dbl> <int>
1 2 4
2 3 30
3 4 52
4 5 11
5 6 1
You may have noticed that anytime we group our data by a variable then we ungroup it shortly after. Whenever we group_by() certain variables then any summarizing and/or mutating will only be done for each portion of the dataset and not the whole. So, we should get in the habit of ungrouping the dataset so that any other code run is applied to the whole dataset and not just within groups. An example of this can be seen below, where we first group the data by the number of bedrooms and calculate the mean price for each group. We then ungroup the data and run the same code and it calculates the overall mean price.
tapply(duke_forest$price, duke_forest$bed, mean) 2 3 4 5 6
349250.0 491650.0 570982.1 707500.0 1250000.0
mean(duke_forest$price)[1] 559898.7
duke_forest |>
group_by(bed) |>
mutate(average_grouped = sum(price) / n()) |>
ungroup() |>
mutate(average_ungrouped = sum(price) / n()) |>
distinct(bed, .keep_all = TRUE) |>
select(bed, average_grouped, average_ungrouped) |>
arrange(bed)# A tibble: 5 × 3
bed average_grouped average_ungrouped
<dbl> <dbl> <dbl>
1 2 349250 559899.
2 3 491650 559899.
3 4 570982. 559899.
4 5 707500 559899.
5 6 1250000 559899.
To observe how we can find the number of distinct observations, we will first manipulate the dataset to create a categorical variable with 4 levels. This could be done using the tools in dplyr, but using the cut() function is just as easy.
summary(duke_forest$area) Min. 1st Qu. Median Mean 3rd Qu. Max.
1094 2133 2623 2779 3254 6178
duke_forest$size <- cut(duke_forest$area,
breaks=c(0, 2133, 2623, 3254, 7000),
labels=c("small", "medium", "large", "huge"))
table(duke_forest$bed, duke_forest$size)
small medium large huge
2 3 1 0 0
3 13 13 3 1
4 9 10 17 16
5 0 0 4 7
6 0 0 0 1
If we notice in the table above, the 2-bedroom houses only fall into 2 distinct categories (“small” and “medium” size) while the 3-bedroom houses fall into 4 distinct categories (“small”, “medium”, “large”, and “huge” size). We can use the n_distinct() function to count the distinct values within groups. An example of this can be seen below:
duke_forest |>
group_by(bed) |>
summarise(num_size = n_distinct(size)) |>
ungroup()# A tibble: 5 × 2
bed num_size
<dbl> <int>
1 2 2
2 3 4
3 4 4
4 5 2
5 6 1
Much like the aggregate function, we can group by multiple variables at the same time. For instance, in the code below we group by both the number of bedrooms a house has and the size of the house. This gives us every possible combination of bedroom and size and then calculates the number of observations in each group and the mean price for each group.
duke_forest |>
filter(bed %in% c(3, 4, 5)) |>
group_by(bed, size) |>
summarise(n=n(),
mean_price = mean(price)) |>
arrange(size, desc(mean_price)) |>
ungroup()# A tibble: 10 × 4
bed size n mean_price
<dbl> <fct> <int> <dbl>
1 4 small 9 417056.
2 3 small 13 360577.
3 4 medium 10 563900
4 3 medium 13 536577.
5 4 large 17 596703.
6 5 large 4 596250
7 3 large 3 522167.
8 3 huge 1 1520000
9 5 huge 7 771071.
10 4 huge 16 634664.
The slice() family of functions (slice_head(), slice_tail(), slice_max(), slice_min(), along with slice_sample() are included) will allow us to select entries in the dataframe or \(n\) entries within groupings. The slice_max() function will allow us to display just the top values for each category. Notice in the code below that we group by the size variable right before we choose to display just the top 2 mean prices for each size grouping.
duke_forest |>
filter(bed %in% c(3, 4, 5)) |>
group_by(bed, size) |>
summarise(n=n(),
mean_price = mean(price)) |>
arrange(size, desc(mean_price)) |>
group_by(size) |>
slice_max(n=2, order_by = mean_price) |>
ungroup()# A tibble: 8 × 4
bed size n mean_price
<dbl> <fct> <int> <dbl>
1 4 small 9 417056.
2 3 small 13 360577.
3 4 medium 10 563900
4 3 medium 13 536577.
5 4 large 17 596703.
6 5 large 4 596250
7 3 huge 1 1520000
8 5 huge 7 771071.
In the example below we can that the slice() function allows us to quickly look at the second through the fifth observations. We chose the first six observations using the head() function to demonstrate the idea.
duke_forest |>
select(price, bed, bath, area, size) |>
head()# A tibble: 6 × 5
price bed bath area size
<dbl> <dbl> <dbl> <dbl> <fct>
1 1520000 3 4 6040 huge
2 1030000 5 4 4475 huge
3 420000 2 3 1745 small
4 680000 4 3 2091 small
5 428500 4 3 1772 small
6 456000 3 3 1950 small
duke_forest |>
select(price, bed, bath, area, size) |>
slice(2:5)# A tibble: 4 × 5
price bed bath area size
<dbl> <dbl> <dbl> <dbl> <fct>
1 1030000 5 4 4475 huge
2 420000 2 3 1745 small
3 680000 4 3 2091 small
4 428500 4 3 1772 small
The summarise() function is powerful in that it allows us to quickly calculate statistics for each indicated group. If we wanted to create a 5-number summary for each group in base R then we would need to use index-selection brackets and the summary function to carry out the task. As a quick reminder, the 5-number summary helps show the spread and skew of a dataset by displaying the minimum value, the first quartile, the second quartile (median), the third quartile, and the maximum value. The first quartile is the value where 25% of the data is less than it, the second quartile is where 50% of the data is less than it, and the third quartile is where 75% of the data is less than it.
summary(duke_forest$price[duke_forest$size=="small"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
105000 285000 385000 374320 456000 680000
summary(duke_forest$price[duke_forest$size=="medium"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
290000 478750 537500 545604 603750 801000
summary(duke_forest$price[duke_forest$size=="large"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
400000 519250 578000 587310 642500 900000
summary(duke_forest$price[duke_forest$size=="huge"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
95000 567000 650000 732885 863000 1520000
Using the group_by() function and the summarise() function allows us to do this so all of the output is together. We just have to take it piece-by-piece within the summarise() function. The code below uses the min(), quantile(), median(), and max() functions to create the 5-number summary, but other functions could be used as well such as first() which will show the first observation, last() which will show the last observation, and nth() which will show the nth observation. These other functions would probably be most useful if you order the dataset first or else it will just select the first or last observation that it sees.
duke_forest |>
group_by(size) |>
summarise(n = n(),
min = min(price),
q1 = quantile(price, 0.25),
q2 = median(price),
q3 = quantile(price, 0.75),
max = max(price)) |>
ungroup()# A tibble: 4 × 7
size n min q1 q2 q3 max
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 small 25 105000 285000 385000 456000 680000
2 medium 24 290000 478750 537500 603750 801000
3 large 24 400000 519250 578000 642500 900000
4 huge 25 95000 567000 650000 863000 1520000
We have already somewhat seen this phenomenon, but it is possible to combine other dplyr functions while grouping the data. For instance, in the code below we first group the data based on the number of bedrooms and the size of the house. We then calculate the number of observations within each group and calculate the mean price. After that, we use the filter() function to only display the groupings that have more than 10 observations present. Additional functions could be chained together (such as mutate(), select(), arrange(), etc.)
duke_forest |>
group_by(bed, size) |>
summarise(n=n(),
mean_price = mean(price)) |>
filter(n >= 10) |>
ungroup()# A tibble: 5 × 4
bed size n mean_price
<dbl> <fct> <int> <dbl>
1 3 small 13 360577.
2 3 medium 13 536577.
3 4 medium 10 563900
4 4 large 17 596703.
5 4 huge 16 634664.
Sometimes it is beneficial to rank the observations based on some provided variable. The rank() function used in conjunction with the mutate() function will assign a rank to each observation with ties receiving the average position. For instance, in the output below 3 observations are tied for the lowest \(n\) value, so they receive the rank of 2 because the average of the ordered values \(1\), \(2\), and \(3\) is 2.
duke_forest |>
group_by(bed, size) |>
summarise(n=n()) |>
ungroup() |>
mutate(ordering = rank(n)) |>
arrange(ordering)# A tibble: 13 × 4
bed size n ordering
<dbl> <fct> <int> <dbl>
1 2 medium 1 2
2 3 huge 1 2
3 6 huge 1 2
4 2 small 3 4.5
5 3 large 3 4.5
6 5 large 4 6
7 5 huge 7 7
8 4 small 9 8
9 4 medium 10 9
10 3 small 13 10.5
11 3 medium 13 10.5
12 4 huge 16 12
13 4 large 17 13
There is also a function called min_rank() which will work similarly but will instead assign every tie the same (smallest) value. We can see the difference in this in the code below:
duke_forest |>
group_by(bed, size) |>
summarise(n=n()) |>
ungroup() |>
mutate(ordering = min_rank(n)) |>
arrange(ordering) |>
head()# A tibble: 6 × 4
bed size n ordering
<dbl> <fct> <int> <int>
1 2 medium 1 1
2 3 huge 1 1
3 6 huge 1 1
4 2 small 3 4
5 3 large 3 4
6 5 large 4 6
There are a variety of cumulative calculations that can be used within the mutate() function. These will return the cumulative statistic based on each group (if we first use the group_by() function). To understand this idea, we can look at the cummin() function which will find the cumulative minimum value. This means it will go through the list and mark down the smallest value it has seen so far for each group (3 then 2 then 1). The cumsum() function will calculate the sum of each value it has seen so far in each group (3, then \(3 + 2\), then \(3 + 2 + 1\)). The cumprod() will work similarly but instead multiply the values seen (3, then \(3 \times 2\), then \(3 \times 2 \times 1\)). An example of this idea can be seen below.
duke_forest |>
group_by(bed) |>
mutate(min_so_far = cummin(bath),
baths_sum = cumsum(bath),
baths_product = cumprod(bath)) |>
select(bed, bath, min_so_far, baths_sum, baths_product) |>
arrange(bed) |>
ungroup() |>
head(8)# A tibble: 8 × 5
bed bath min_so_far baths_sum baths_product
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 3 3 3 3
2 2 2 2 5 6
3 2 1 1 6 6
4 2 2 1 8 12
5 3 4 4 4 4
6 3 3 3 7 12
7 3 2 2 9 24
8 3 3 2 12 72
The last example we will look at deals with time series data, which means each element in the vector is measured at a set time interval. An example of this can be seen below as a single chick’s weight is measured at time 0, 2, 4, 6, etc. This allows us to see how the weight changes over time.
ChickWeight$Chick <- as.numeric(as.character((ChickWeight$Chick)))
head(ChickWeight, 5) weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
The lag() and the lead() function will allow us to access either the previous value or the future value respectively. The value NA is given if there is no previous or future value available. This process allows us to determine how much an observation changes over time. In the example below we can see how much each chick’s weight changes between readings.
ChickWeight |>
group_by(Chick) |>
arrange(Time, .by_group = TRUE) |>
mutate(lag_weight = lag(weight),
difference = weight - lag_weight) |>
slice_head(n = 5) |>
select(Chick, Time, weight, lag_weight, difference) |>
ungroup()# A tibble: 247 × 5
Chick Time weight lag_weight difference
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 42 NA NA
2 1 2 51 42 9
3 1 4 59 51 8
4 1 6 64 59 5
5 1 8 76 64 12
6 2 0 40 NA NA
7 2 2 49 40 9
8 2 4 58 49 9
9 2 6 72 58 14
10 2 8 84 72 12
# ℹ 237 more rows