7  Dataframes in R

Within R, we can form structured data with multiple columns. These are called Dataframes and are comparable to the way a spreadsheet may look. Within each dataframe, multiple columns can be present, with each column being a vector. Additionally, the column types may vary, as we can have a numeric vector, a logical vector, and a character vector all in the same dataframe. It is important to talk about dataframes as this is the predominant data structure in R. Almost all of the datasets that we encounter will be formatted as a dataframe.

  • Understand what a dataframe is and how to create dataframes from vectors using the data.frame() function.
  • Interpret the structure of a dataframe using str() along with how to access elements of a dataframe using both bracket notation and the dollar sign syntax.
  • Modify dataframes by renaming columns, removing rows or columns, and appending new data.
  • Use logical indexing to filter dataframe rows that meet specific criteria.

7.1 Creating Dataframes

Within a dataframe, the rows will represent an observation. Additionally, each vector (column) will have the same length as all of the others, resulting in a “rectangular” looking dataframe. We can make one ourselves by specifying the vectors in the data.frame() function. An example of this can be seen below:

x <- 12:16
y <- 80:84
df <- data.frame(x,y)
df
   x  y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84

Another example can be seen below. In creating this dataframe we will utilize our sample() function to randomly generate a vector of length 10. When you run the same code you might have different values generated since it does so randomly. To display it I will use the head() function will display just the first few observations:

names <- sample(c("John", "Paul", "George", "Ringo"), 10, replace=TRUE)
ages <- sample(18:25, 10, replace=TRUE)
major <- sample(c("Undeclared", "Math", "Cyber", "Data", "Comp-Sci"), 10, replace=TRUE)
commuter <- sample(c(TRUE, FALSE), 10, replace=TRUE)
df <- data.frame(names, ages, major, commuter)
head(df, 5) # Retrieves the first 5 rows (6 by default)
   names ages    major commuter
1  Ringo   22     Math     TRUE
2   Paul   22     Data     TRUE
3   John   18 Comp-Sci    FALSE
4 George   23    Cyber     TRUE
5  Ringo   21     Math    FALSE
Try it Out

Emmit is organizing a small conference and wants to keep track of attendees. Make a dataframe which looks exactly like the following list:

    name age      role registered
1   Adam  30  attendee       TRUE
2   Brad  49   speaker       TRUE
3 Claire  32 organizer       TRUE
4 Donald  28 attendeee      FALSE
5 Elaine  27  attendee      FALSE
6  Fiona  33 organizer       TRUE
Click to see the solution

XXXX INSERT VIDEO XXXX

7.2 The Structure of Dataframes

One nice thing about dataframes is that the individual columns will retain the same class type that the vector has. We can see the structure of the dataframe by passing the dataframe into the str() function.

str(df)
'data.frame':   10 obs. of  4 variables:
 $ names   : chr  "Ringo" "Paul" "John" "George" ...
 $ ages    : int  22 22 18 23 21 21 18 25 18 20
 $ major   : chr  "Math" "Data" "Comp-Sci" "Cyber" ...
 $ commuter: logi  TRUE TRUE FALSE TRUE FALSE TRUE ...

We can also bind vectors together using the cbind() function, but we should be weary about this as the vector types will be altered to the “lowest” type if they are different. The function name cbind() stands for column bind, which will attach a new column on the end of another vector/dataframe. An example of this can be seen below where all of the columns are turned into characters

df2 <- data.frame(cbind(names, ages, major, commuter))
head(df2,5)
   names ages    major commuter
1  Ringo   22     Math     TRUE
2   Paul   22     Data     TRUE
3   John   18 Comp-Sci    FALSE
4 George   23    Cyber     TRUE
5  Ringo   21     Math    FALSE
str(df2)
'data.frame':   10 obs. of  4 variables:
 $ names   : chr  "Ringo" "Paul" "John" "George" ...
 $ ages    : chr  "22" "22" "18" "23" ...
 $ major   : chr  "Math" "Data" "Comp-Sci" "Cyber" ...
 $ commuter: chr  "TRUE" "TRUE" "FALSE" "TRUE" ...

We can also attach a new row on the end of a dataframe using the rbind() function. Once again though, we should be wary about this as we can potentially alter the column types.

data_to_be_added <- c("Pete", 24, "Percussion", FALSE)
df_added <- rbind(df, data_to_be_added)
tail(df_added, 5) # Retrieves the last 5 observations of the dataframe
    names ages      major commuter
7  George   18 Undeclared    FALSE
8    Paul   25       Math     TRUE
9    John   18   Comp-Sci     TRUE
10   John   20      Cyber     TRUE
11   Pete   24 Percussion    FALSE
Try it Out

Using the dataframe you previously made for Emmit, add a new row to the list of attendees: Gracie who is 39, a speaker, and not registered.

Click to see the solution

XXXX INSERT VIDEO XXXX

7.3 Dataframe Properties

We can learn about the dataframe’s properties using a few different functions. One function called dim() will tell us about the number of rows and columns in the dataset (the dimension). Meanwhile, nrow() will tell us the number of rows, and ncol() will tell us the number of columns in the dataframe.

dim(df)
[1] 10  4
nrow(df)
[1] 10
ncol(df)
[1] 4
Try it Out

Using the dataframe you previously made for Emmit with Gracie now added, determine the number of rows and columns the dataframe has.

Click to see the solution

XXXX INSERT VIDEO XXXX

7.4 Column names for Dataframes

Sometimes it is helpful to re-name the columns of a dataframe if we do not like the current names of the vector. We can do this when we create the dataframe like in the example below:

x <- 0:9
y <- 10:19
z <- 20:29

df3 <- data.frame("singles"=x, "tens"=y, "twenties"=z)
head(df3)
  singles tens twenties
1       0   10       20
2       1   11       21
3       2   12       22
4       3   13       23
5       4   14       24
6       5   15       25

If we do not name them during the creation of the dataframe or we have a dataframe already in R and we want to rename the column names then we can do this using the colnames() function.

df3 <- data.frame(x,y,z)
colnames(df3)
[1] "x" "y" "z"
colnames(df3) <- c("singles", "tens", "twenties")
head(df3,4)
  singles tens twenties
1       0   10       20
2       1   11       21
3       2   12       22
4       3   13       23
Try it Out

Emmit decides that the dataframe for the conference should have different header names. Alter the dataframe so the column names are now: “participant_name”, “participant_age”, “conference_role”, and “paid”.

Click to see the solution

XXXX INSERT VIDEO XXXX

7.5 Dataframe Index Selection

Accessing dataframe elements will be similar to accessing vector elements except now we are dealing with a 2-dimensional object in R. Thus, we will need to specify both dimensions (row and column). You will hear me say over and over again in class: “Dataframes index by Row comma Column”. Within our index selection brackets, we will need to have the comma present. If we put nothing before the comma it will indicate all rows, while nothing after the comma will indicate all columns.

If we would wish to display the element in the second row and first column we would make sure we call our dataframe and then in the index selection brackets we would say ‘[2,1]’. If we wanted to display the 3rd observation (3rd row) then we could just say ‘[3,]’ in our index-selection brackets. To display all of the elements in the 2nd column we would say ‘[,2]’ in our index-selection brackets. An example of this can be seen below:

head(df, 4)
   names ages    major commuter
1  Ringo   22     Math     TRUE
2   Paul   22     Data     TRUE
3   John   18 Comp-Sci    FALSE
4 George   23    Cyber     TRUE
df[2,1] # Element in the 2nd row and 1st column
[1] "Paul"
df[3,] # Elements in the 3rd row
  names ages    major commuter
3  John   18 Comp-Sci    FALSE
df[,2] # Elements in the 2nd column
 [1] 22 22 18 23 21 21 18 25 18 20

We can also retrieve elements in a column by using a dollar sign ($) and then typing the column name. This resulting output will be a vector, not a dataframe, and will contain all of the values in that column.

df$names
 [1] "Ringo"  "Paul"   "John"   "George" "Ringo"  "George" "George" "Paul"  
 [9] "John"   "John"  
df$ages
 [1] 22 22 18 23 21 21 18 25 18 20
df$major
 [1] "Math"       "Data"       "Comp-Sci"   "Cyber"      "Math"      
 [6] "Data"       "Undeclared" "Math"       "Comp-Sci"   "Cyber"     
df$commuter
 [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

We can retrieve multiple rows and/or columns by passing a vector into our index-selection brackets. It should also be noted that we can select vector elements on a vector result. Additionally, we can pass a logical vector into the index-selection brackets to display values that meet certain criteria. Examples of these can be seen below:

df[3:5, c(1,3)] # Rows 3 through 5 and columns 1 and 3 
   names    major
3   John Comp-Sci
4 George    Cyber
5  Ringo     Math
df$ages
 [1] 22 22 18 23 21 21 18 25 18 20
df$ages[2:4]
[1] 22 18 23
df$ages <= 21
 [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
df[df$ages <= 21, ] # Displays just the TRUE rows
    names ages      major commuter
3    John   18   Comp-Sci    FALSE
5   Ringo   21       Math    FALSE
6  George   21       Data     TRUE
7  George   18 Undeclared    FALSE
9    John   18   Comp-Sci     TRUE
10   John   20      Cyber     TRUE
Try it Out

Emmit would like to display some key information from the conference attendees. First he would like to display all of the people that are speaking, then he would like to display all of the people who are under the age of 35, and finally he would like to display all of the people who are either attendees and have paid the registration fee.

Click to see the solution

XXXX INSERT VIDEO XXXX

7.6 Editing a Dataframe

Finally, we can remove a single row or column from our dataframe, but we want to be very careful of doing this as we might not be able to reverse the action. To remove a row or column we can simply overwrite our dataframe by stating our dataframe and then in our index-selection brackets indicate which row/column you want to remove with a negative sign. Both methods can be seen below:

df4 <- df[-2,] # Everything but the second row
head(df4,3)
   names ages    major commuter
1  Ringo   22     Math     TRUE
3   John   18 Comp-Sci    FALSE
4 George   23    Cyber     TRUE
df4 <- df[,-3] # Everything but the third column
head(df4,3)
  names ages commuter
1 Ringo   22     TRUE
2  Paul   22     TRUE
3  John   18    FALSE
Try it Out

Emmit decides that he does not need to record the attendee’s age. He also found out that Adam is withdrawing from the conference. Help him remove the age column from the dataframe along with Adam’s information.

Click to see the solution

XXXX INSERT VIDEO XXXX

  • Create a dataframe in R pertaining to cities in New York:
    1. A vector called cities with “Albany”, “Buffalo”, “Syracuse”, “Rochester”, “Ithaca” all included
    2. A vector called population with the values 97500, 255000, 142000, 210000, 32000
    3. A logical vector called capital with Albany set to True and the rest being False
    4. Display the structure of the dataframe.
    5. Add a new column using cbind() called temperature with the values: 48,46,47,49,45.
    6. Use rbind() to add the row: “Binghamton”, 47000, FALSE, 44.
    7. Explain what happens to the column types after adding the row.
  • Create a dataframe in R pertaining to states on the East Coast:
    1. A vector containing the following states: “NY”, “MD”, “PA”, “NJ”, “VA”
    2. A vector containing the following median income values: 75000, 88000, 68000, 85000, 72000
    3. A logicial vector based on if it is a costal state (only Pennsylvania is not on the coast)
    4. Display all of the costal states
    5. Display all of the rows where the median income is greater than 80000
    6. Display all of the rows where the state ends in a vowel or has a median income less than 77000