8  Grouped and Bivariate Analysis

When manipulating data, it is often useful to select certain observations, order the data in a certain way, generate new variables, or summarize the information. We have previously seen how to filter, arrange, and mutate variables within our dataset using the tools available within the dplyr library. In this lecture, we will see how we can summarize the information we are given and calculate statistics based on a variety of groupings. To summarize our data based on multiple variables (bi-variate data) we will use the tools within the dplyr library to carry out this process.

  • Use group_by() and summarise() to compute descriptive statistics for each category in a dataset.
  • Use n(), tally(), and count() to count observations within groups and interpret those counts.
  • Explain why ungroup() matters and predict how grouped vs. ungrouped operations change results.
  • Group by multiple variables to create bivariate summaries and use arrange()/filter() to answer comparison questions.
  • Use n_distinct(), the slice() family, ranking functions, and lag()/lead() for more advanced grouped analysis.

8.1 Grouping and Summarizing

Within base R we can break a dataset up by which group a variable is in and then calculate some statistic for it (like the mean, median, standard deviation, etc.). To do this, we used the aggregate() function and list the groups that we wanted to partition the dataset into before calculating the statistic. We can pass multiple grouping variables into the list() argument to break the dataset up into more specific groupings. An example of this process can be seen below where we calculate the average home price for houses with either 2, 3, etc. bedrooms.

library(openintro)
aggregate(duke_forest$price, by=list(num_beds=duke_forest$bed), mean)
  num_beds         x
1        2  349250.0
2        3  491650.0
3        4  570982.1
4        5  707500.0
5        6 1250000.0

A similar process could be carried out using the tools available in the dplyr library. The group_by() function will partition the dataset into multiple groups thus allowing us to carry out calculations on different groups. The summarise() function then summarizes (…who would have thought) each group based on what we tell it to do. When both functions are used together it will output 1 row for each grouped value, which is useful when calculating the descriptive statistics among groups.

The group_by() and summarise() functions working together will allow us to essentially replicate the aggregate() and apply() functions in base R. Note that the summarize() function will do the same thing, but it is sometimes masked by another library (meaning another library has a function of the same name and it sometimes “overrides” the dplyr function depending on the order the libraries are loaded), so summarise() is preferred.

library(dplyr)
duke_forest |>
  group_by(bed) |>
  summarise(mean_price = mean(price)) |>
  ungroup()
# A tibble: 5 × 2
    bed mean_price
  <dbl>      <dbl>
1     2    349250 
2     3    491650 
3     4    570982.
4     5    707500 
5     6   1250000 

8.2 Counting Observations

In the output above we can see that we have replicated the aggregate() function’s results. The way we have done it in dplyr allows us to have more flexibility though, as we can add more information to the output. For instance, if we want to count the number of observations in each group then within the summarise() function we can specify n(). An example of this can be seen below where we both count the number of observations for each category and calculate the mean price for each category.

duke_forest |>
  group_by(bed) |>
  summarise(n=n(),
            mean_price = mean(price)) |>
  ungroup()
# A tibble: 5 × 3
    bed     n mean_price
  <dbl> <int>      <dbl>
1     2     4    349250 
2     3    30    491650 
3     4    52    570982.
4     5    11    707500 
5     6     1   1250000 

If one wishes to just determine how many observations are in each category then the tally() function will do the trick if we first group_by() the variables we want to count the number of observations for. The count() function is also a possibility for a quick count and does not require us to first group by a specific variable.

duke_forest |>
  group_by(bed) |>
  tally() |>
  ungroup()
# A tibble: 5 × 2
    bed     n
  <dbl> <int>
1     2     4
2     3    30
3     4    52
4     5    11
5     6     1
duke_forest |> 
  count(bed)
# A tibble: 5 × 2
    bed     n
  <dbl> <int>
1     2     4
2     3    30
3     4    52
4     5    11
5     6     1

8.3 The Importance of Ungrouping

You may have noticed that anytime we group our data by a variable then we ungroup it shortly after. Whenever we group_by() certain variables then any summarizing and/or mutating will only be done for each portion of the dataset and not the whole. So, we should get in the habit of ungrouping the dataset so that any other code run is applied to the whole dataset and not just within groups. An example of this can be seen below, where we first group the data by the number of bedrooms and calculate the mean price for each group. We then ungroup the data and run the same code and it calculates the overall mean price.

tapply(duke_forest$price, duke_forest$bed, mean)
        2         3         4         5         6 
 349250.0  491650.0  570982.1  707500.0 1250000.0 
mean(duke_forest$price)
[1] 559898.7
duke_forest |>
  group_by(bed) |>
  mutate(average_grouped = sum(price) / n()) |>
  ungroup() |>
  mutate(average_ungrouped = sum(price) / n()) |>
  distinct(bed, .keep_all = TRUE) |>
  select(bed, average_grouped, average_ungrouped) |>
  arrange(bed)
# A tibble: 5 × 3
    bed average_grouped average_ungrouped
  <dbl>           <dbl>             <dbl>
1     2         349250            559899.
2     3         491650            559899.
3     4         570982.           559899.
4     5         707500            559899.
5     6        1250000            559899.

8.4 Finding Distinct Observations

To observe how we can find the number of distinct observations, we will first manipulate the dataset to create a categorical variable with 4 levels. This could be done using the tools in dplyr, but using the cut() function is just as easy.

summary(duke_forest$area)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1094    2133    2623    2779    3254    6178 
duke_forest$size <- cut(duke_forest$area, 
                        breaks=c(0, 2133, 2623, 3254, 7000),
                        labels=c("small", "medium", "large", "huge"))

table(duke_forest$bed, duke_forest$size)
   
    small medium large huge
  2     3      1     0    0
  3    13     13     3    1
  4     9     10    17   16
  5     0      0     4    7
  6     0      0     0    1

If we notice in the table above, the 2-bedroom houses only fall into 2 distinct categories (“small” and “medium” size) while the 3-bedroom houses fall into 4 distinct categories (“small”, “medium”, “large”, and “huge” size). We can use the n_distinct() function to count the distinct values within groups. An example of this can be seen below:

duke_forest |>
  group_by(bed) |>
  summarise(num_size = n_distinct(size)) |>
  ungroup()
# A tibble: 5 × 2
    bed num_size
  <dbl>    <int>
1     2        2
2     3        4
3     4        4
4     5        2
5     6        1

8.5 Grouping by Multiple Variables

Much like the aggregate function, we can group by multiple variables at the same time. For instance, in the code below we group by both the number of bedrooms a house has and the size of the house. This gives us every possible combination of bedroom and size and then calculates the number of observations in each group and the mean price for each group.

duke_forest |>
  filter(bed %in% c(3, 4, 5)) |>
  group_by(bed, size) |>
  summarise(n=n(), 
            mean_price = mean(price)) |>
  arrange(size, desc(mean_price)) |>
  ungroup()
# A tibble: 10 × 4
     bed size       n mean_price
   <dbl> <fct>  <int>      <dbl>
 1     4 small      9    417056.
 2     3 small     13    360577.
 3     4 medium    10    563900 
 4     3 medium    13    536577.
 5     4 large     17    596703.
 6     5 large      4    596250 
 7     3 large      3    522167.
 8     3 huge       1   1520000 
 9     5 huge       7    771071.
10     4 huge      16    634664.

The slice() family of functions (slice_head(), slice_tail(), slice_max(), slice_min(), along with slice_sample() are included) will allow us to select entries in the dataframe or \(n\) entries within groupings. The slice_max() function will allow us to display just the top values for each category. Notice in the code below that we group by the size variable right before we choose to display just the top 2 mean prices for each size grouping.

duke_forest |>
  filter(bed %in% c(3, 4, 5)) |>
  group_by(bed, size) |>
  summarise(n=n(), 
            mean_price = mean(price)) |>
  arrange(size, desc(mean_price)) |>
  group_by(size) |>
  slice_max(n=2, order_by = mean_price) |>
  ungroup()
# A tibble: 8 × 4
    bed size       n mean_price
  <dbl> <fct>  <int>      <dbl>
1     4 small      9    417056.
2     3 small     13    360577.
3     4 medium    10    563900 
4     3 medium    13    536577.
5     4 large     17    596703.
6     5 large      4    596250 
7     3 huge       1   1520000 
8     5 huge       7    771071.

In the example below we can that the slice() function allows us to quickly look at the second through the fifth observations. We chose the first six observations using the head() function to demonstrate the idea.

duke_forest |>
  select(price, bed, bath, area, size) |> 
  head()
# A tibble: 6 × 5
    price   bed  bath  area size 
    <dbl> <dbl> <dbl> <dbl> <fct>
1 1520000     3     4  6040 huge 
2 1030000     5     4  4475 huge 
3  420000     2     3  1745 small
4  680000     4     3  2091 small
5  428500     4     3  1772 small
6  456000     3     3  1950 small
duke_forest |>
  select(price, bed, bath, area, size) |>
  slice(2:5)
# A tibble: 4 × 5
    price   bed  bath  area size 
    <dbl> <dbl> <dbl> <dbl> <fct>
1 1030000     5     4  4475 huge 
2  420000     2     3  1745 small
3  680000     4     3  2091 small
4  428500     4     3  1772 small

8.6 Creating a 5 Number Summary

The summarise() function is powerful in that it allows us to quickly calculate statistics for each indicated group. If we wanted to create a 5-number summary for each group in base R then we would need to use index-selection brackets and the summary function to carry out the task. As a quick reminder, the 5-number summary helps show the spread and skew of a dataset by displaying the minimum value, the first quartile, the second quartile (median), the third quartile, and the maximum value. The first quartile is the value where 25% of the data is less than it, the second quartile is where 50% of the data is less than it, and the third quartile is where 75% of the data is less than it.

summary(duke_forest$price[duke_forest$size=="small"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 105000  285000  385000  374320  456000  680000 
summary(duke_forest$price[duke_forest$size=="medium"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 290000  478750  537500  545604  603750  801000 
summary(duke_forest$price[duke_forest$size=="large"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 400000  519250  578000  587310  642500  900000 
summary(duke_forest$price[duke_forest$size=="huge"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  95000  567000  650000  732885  863000 1520000 

Using the group_by() function and the summarise() function allows us to do this so all of the output is together. We just have to take it piece-by-piece within the summarise() function. The code below uses the min(), quantile(), median(), and max() functions to create the 5-number summary, but other functions could be used as well such as first() which will show the first observation, last() which will show the last observation, and nth() which will show the nth observation. These other functions would probably be most useful if you order the dataset first or else it will just select the first or last observation that it sees.

duke_forest |>
  group_by(size) |>
  summarise(n = n(),
            min = min(price),
            q1 = quantile(price, 0.25),
            q2 = median(price), 
            q3 = quantile(price, 0.75),
            max = max(price)) |>
  ungroup()
# A tibble: 4 × 7
  size       n    min     q1     q2     q3     max
  <fct>  <int>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1 small     25 105000 285000 385000 456000  680000
2 medium    24 290000 478750 537500 603750  801000
3 large     24 400000 519250 578000 642500  900000
4 huge      25  95000 567000 650000 863000 1520000

8.7 Incorporating other dplyr functions

We have already somewhat seen this phenomenon, but it is possible to combine other dplyr functions while grouping the data. For instance, in the code below we first group the data based on the number of bedrooms and the size of the house. We then calculate the number of observations within each group and calculate the mean price. After that, we use the filter() function to only display the groupings that have more than 10 observations present. Additional functions could be chained together (such as mutate(), select(), arrange(), etc.)

duke_forest |>
  group_by(bed, size) |>
  summarise(n=n(), 
            mean_price = mean(price)) |>
  filter(n >= 10) |>
  ungroup()
# A tibble: 5 × 4
    bed size       n mean_price
  <dbl> <fct>  <int>      <dbl>
1     3 small     13    360577.
2     3 medium    13    536577.
3     4 medium    10    563900 
4     4 large     17    596703.
5     4 huge      16    634664.

8.8 Dealing with Rankings

Sometimes it is beneficial to rank the observations based on some provided variable. The rank() function used in conjunction with the mutate() function will assign a rank to each observation with ties receiving the average position. For instance, in the output below 3 observations are tied for the lowest \(n\) value, so they receive the rank of 2 because the average of the ordered values \(1\), \(2\), and \(3\) is 2.

duke_forest |>
  group_by(bed, size) |>
  summarise(n=n()) |>
  ungroup() |>
  mutate(ordering = rank(n)) |>
  arrange(ordering)
# A tibble: 13 × 4
     bed size       n ordering
   <dbl> <fct>  <int>    <dbl>
 1     2 medium     1      2  
 2     3 huge       1      2  
 3     6 huge       1      2  
 4     2 small      3      4.5
 5     3 large      3      4.5
 6     5 large      4      6  
 7     5 huge       7      7  
 8     4 small      9      8  
 9     4 medium    10      9  
10     3 small     13     10.5
11     3 medium    13     10.5
12     4 huge      16     12  
13     4 large     17     13  

There is also a function called min_rank() which will work similarly but will instead assign every tie the same (smallest) value. We can see the difference in this in the code below:

duke_forest |>
  group_by(bed, size) |>
  summarise(n=n()) |>
  ungroup() |>
  mutate(ordering = min_rank(n)) |>
  arrange(ordering) |>
  head()
# A tibble: 6 × 4
    bed size       n ordering
  <dbl> <fct>  <int>    <int>
1     2 medium     1        1
2     3 huge       1        1
3     6 huge       1        1
4     2 small      3        4
5     3 large      3        4
6     5 large      4        6

8.9 Cumulative Calculations

There are a variety of cumulative calculations that can be used within the mutate() function. These will return the cumulative statistic based on each group (if we first use the group_by() function). To understand this idea, we can look at the cummin() function which will find the cumulative minimum value. This means it will go through the list and mark down the smallest value it has seen so far for each group (3 then 2 then 1). The cumsum() function will calculate the sum of each value it has seen so far in each group (3, then \(3 + 2\), then \(3 + 2 + 1\)). The cumprod() will work similarly but instead multiply the values seen (3, then \(3 \times 2\), then \(3 \times 2 \times 1\)). An example of this idea can be seen below.

duke_forest |>
  group_by(bed) |>
  mutate(min_so_far = cummin(bath),
         baths_sum = cumsum(bath),
         baths_product = cumprod(bath)) |>
  select(bed, bath, min_so_far, baths_sum, baths_product) |>
  arrange(bed) |>
  ungroup() |>
  head(8)
# A tibble: 8 × 5
    bed  bath min_so_far baths_sum baths_product
  <dbl> <dbl>      <dbl>     <dbl>         <dbl>
1     2     3          3         3             3
2     2     2          2         5             6
3     2     1          1         6             6
4     2     2          1         8            12
5     3     4          4         4             4
6     3     3          3         7            12
7     3     2          2         9            24
8     3     3          2        12            72

8.10 Time-Series Manipulation

The last example we will look at deals with time series data, which means each element in the vector is measured at a set time interval. An example of this can be seen below as a single chick’s weight is measured at time 0, 2, 4, 6, etc. This allows us to see how the weight changes over time.

ChickWeight$Chick <- as.numeric(as.character((ChickWeight$Chick)))
head(ChickWeight, 5)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1

The lag() and the lead() function will allow us to access either the previous value or the future value respectively. The value NA is given if there is no previous or future value available. This process allows us to determine how much an observation changes over time. In the example below we can see how much each chick’s weight changes between readings.

ChickWeight |>
  group_by(Chick) |>
  arrange(Time, .by_group = TRUE) |>
  mutate(lag_weight = lag(weight),
         difference = weight - lag_weight) |>
  slice_head(n = 5) |>
  select(Chick, Time, weight, lag_weight, difference) |>
  ungroup()
# A tibble: 247 × 5
   Chick  Time weight lag_weight difference
   <dbl> <dbl>  <dbl>      <dbl>      <dbl>
 1     1     0     42         NA         NA
 2     1     2     51         42          9
 3     1     4     59         51          8
 4     1     6     64         59          5
 5     1     8     76         64         12
 6     2     0     40         NA         NA
 7     2     2     49         40          9
 8     2     4     58         49          9
 9     2     6     72         58         14
10     2     8     84         72         12
# ℹ 237 more rows