7  Dataframes in R

Within R, we can form structured data with multiple columns. These are called Dataframes and are comparable to the way a spreadsheet may look. Within each dataframe, multiple columns can be present, with each column being a vector. Additionally, the column types may vary, as we can have a numeric vector, a logical vector, and a character vector all in the same dataframe. It is important to talk about dataframes as this is the predominant data structure in R. Almost all of the datasets that we encounter will be formatted as a dataframe.

7.1 Creating Dataframes

Within a dataframe, the rows will represent an observation. Additionally, each vector (column) will have the same length as all of the others, resulting in a “rectangular” looking dataframe. We can make one ourselves by specifying the vectors in the data.frame() function. An example of this can be seen below:

x <- 12:16
y <- 80:84
df <- data.frame(x,y)
df
   x  y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84

Another example can be seen below. In creating this dataframe we will utilize our sample() function to randomly generate a vector of length 10. When you run the same code you might have different values generated since it does so randomly. To display it I will use the head() function will display just the first few observations:

names <- sample(c("John", "Paul", "George", "Ringo"), 10, replace=TRUE)
ages <- sample(18:25, 10, replace=TRUE)
major <- sample(c("Undeclared", "Math", "Cyber", "Data", "Comp-Sci"), 10, replace=TRUE)
commuter <- sample(c(TRUE, FALSE), 10, replace=TRUE)
df <- data.frame(names, ages, major, commuter)
head(df, 5) # Retrieves the first 5 rows (6 by default)
   names ages major commuter
1  Ringo   19 Cyber    FALSE
2 George   18 Cyber     TRUE
3   Paul   20 Cyber    FALSE
4  Ringo   23 Cyber     TRUE
5   Paul   22 Cyber     TRUE

7.2 The Structure of Dataframes

One nice thing about dataframes is that the individual columns will retain the same class type that the vector has. We can see the structure of the dataframe by passing the dataframe into the str() function.

str(df)
'data.frame':   10 obs. of  4 variables:
 $ names   : chr  "Ringo" "George" "Paul" "Ringo" ...
 $ ages    : int  19 18 20 23 22 25 18 22 21 22
 $ major   : chr  "Cyber" "Cyber" "Cyber" "Cyber" ...
 $ commuter: logi  FALSE TRUE FALSE TRUE TRUE FALSE ...

We can also bind vectors together using the cbind() function, but we should be weary about this as the vector types will be altered to the “lowest” type if they are different. The function name cbind() stands for column bind, which will attach a new column on the end of another vector/dataframe. An example of this can be seen below where all of the columns are turned into characters

df2 <- data.frame(cbind(names, ages, major, commuter))
head(df2,5)
   names ages major commuter
1  Ringo   19 Cyber    FALSE
2 George   18 Cyber     TRUE
3   Paul   20 Cyber    FALSE
4  Ringo   23 Cyber     TRUE
5   Paul   22 Cyber     TRUE
str(df2)
'data.frame':   10 obs. of  4 variables:
 $ names   : chr  "Ringo" "George" "Paul" "Ringo" ...
 $ ages    : chr  "19" "18" "20" "23" ...
 $ major   : chr  "Cyber" "Cyber" "Cyber" "Cyber" ...
 $ commuter: chr  "FALSE" "TRUE" "FALSE" "TRUE" ...

We can also attach a new row on the end of a dataframe using the rbind() function. Once again though, we should be wary about this as we can potentially alter the column types.

data_to_be_added <- c("Pete", 24, "Percussion", FALSE)
df_added <- rbind(df, data_to_be_added)
tail(df_added, 5) # Retrieves the last 5 observations of the dataframe
   names ages      major commuter
7  Ringo   18       Math     TRUE
8   John   22      Cyber    FALSE
9  Ringo   21      Cyber     TRUE
10 Ringo   22       Math    FALSE
11  Pete   24 Percussion    FALSE

7.3 Dataframe Properties

We can learn about the dataframe’s properties using a few different functions. One function called dim() will tell us about the number of rows and columns in the dataset (the dimension). Meanwhile, nrow() will tell us the number of rows, and ncol() will tell us the number of columns in the dataframe.

dim(df)
[1] 10  4
nrow(df)
[1] 10
ncol(df)
[1] 4

7.4 Column names for Dataframes

Sometimes it is helpful to name the columns of a dataframe if we do not like the current names of the vector. We can do this when we create the dataframe like in the example below:

x <- 0:9
y <- 10:19
z <- 20:29

df3 <- data.frame("singles"=x, "tens"=y, "twenties"=z)
head(df3)
  singles tens twenties
1       0   10       20
2       1   11       21
3       2   12       22
4       3   13       23
5       4   14       24
6       5   15       25

If we do not name them during the creation of the dataframe or we have a dataframe already in R and we want to rename the column names then we can do this using the colnames() function.

df3 <- data.frame(x,y,z)
colnames(df3)
[1] "x" "y" "z"
colnames(df3) <- c("singles", "tens", "twenties")
head(df3,4)
  singles tens twenties
1       0   10       20
2       1   11       21
3       2   12       22
4       3   13       23

7.5 Dataframe Index Selection

Accessing dataframe elements will be similar to accessing vector elements except now we are dealing with a 2-dimensional object in R. Thus, we will need to specify both dimensions (row and column). You will hear me say over and over again in class: “Dataframes index by Row comma Column”. Within our index selection brackets, we will need to have the comma present. If we put nothing before the comma it will indicate all rows, while nothing after the comma will indicate all columns.

If we would wish to display the element in the second row and first column we would make sure we call our dataframe and then in the index selection brackets we would say ‘[2,1]’. If we wanted to display the 3rd observation (3rd row) then we could just say ‘[3,]’ in our index-selection brackets. To display all of the elements in the 2nd column we would say ‘[,2]’ in our index-selection brackets. An example of this can be seen below:

head(df, 4)
   names ages major commuter
1  Ringo   19 Cyber    FALSE
2 George   18 Cyber     TRUE
3   Paul   20 Cyber    FALSE
4  Ringo   23 Cyber     TRUE
df[2,1] # Element in the 2nd row and 1st column
[1] "George"
df[3,] # Elements in the 3rd row
  names ages major commuter
3  Paul   20 Cyber    FALSE
df[,2] # Elements in the 2nd column
 [1] 19 18 20 23 22 25 18 22 21 22

We can also retrieve elements in a column by using a dollar sign ($) and then typing the column name. This resulting output will be a vector, not a dataframe, and will contain all of the values in that column.

df$names
 [1] "Ringo"  "George" "Paul"   "Ringo"  "Paul"   "George" "Ringo"  "John"  
 [9] "Ringo"  "Ringo" 
df$ages
 [1] 19 18 20 23 22 25 18 22 21 22
df$major
 [1] "Cyber" "Cyber" "Cyber" "Cyber" "Cyber" "Cyber" "Math"  "Cyber" "Cyber"
[10] "Math" 
df$commuter
 [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE

We can retrieve multiple rows and/or columns by passing a vector into our index-selection brackets. It should also be noted that we can select vector elements on a vector result. Additionally, we can pass a logical vector into the index-selection brackets to display values that meet certain criteria. Examples of these can be seen below:

df[3:5, c(1,3)] # Rows 3 through 5 and columns 1 and 3 
  names major
3  Paul Cyber
4 Ringo Cyber
5  Paul Cyber
df$ages
 [1] 19 18 20 23 22 25 18 22 21 22
df$ages[2:4]
[1] 18 20 23
df$ages <= 21
 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
df[df$ages <= 21, ] # Displays just the TRUE rows
   names ages major commuter
1  Ringo   19 Cyber    FALSE
2 George   18 Cyber     TRUE
3   Paul   20 Cyber    FALSE
7  Ringo   18  Math     TRUE
9  Ringo   21 Cyber     TRUE

7.6 Editing a Dataframe

Finally, we can remove a single row or column from our dataframe, but we want to be very careful of doing this as we might not be able to reverse the action. To remove a row or column we can simply overwrite our dataframe by stating our dataframe and then in our index-selection brackets indicate which row/column you want to remove with a negative sign. Both methods can be seen below:

df4 <- df[-2,] # Everything but the second row
head(df4,3)
  names ages major commuter
1 Ringo   19 Cyber    FALSE
3  Paul   20 Cyber    FALSE
4 Ringo   23 Cyber     TRUE
df4 <- df[,-3] # Everything but the third column
head(df4,3)
   names ages commuter
1  Ringo   19    FALSE
2 George   18     TRUE
3   Paul   20    FALSE