x <- 12:16
y <- 80:84
df <- data.frame(x,y)
df x y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84
Within R, we can form structured data with multiple columns. These are called Dataframes and are comparable to the way a spreadsheet may look. Within each dataframe, multiple columns can be present, with each column being a vector. Additionally, the column types may vary, as we can have a numeric vector, a logical vector, and a character vector all in the same dataframe. It is important to talk about dataframes as this is the predominant data structure in R. Almost all of the datasets that we encounter will be formatted as a dataframe.
Within a dataframe, the rows will represent an observation. Additionally, each vector (column) will have the same length as all of the others, resulting in a “rectangular” looking dataframe. We can make one ourselves by specifying the vectors in the data.frame() function. An example of this can be seen below:
x <- 12:16
y <- 80:84
df <- data.frame(x,y)
df x y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84
Another example can be seen below. In creating this dataframe we will utilize our sample() function to randomly generate a vector of length 10. When you run the same code you might have different values generated since it does so randomly. To display it I will use the head() function will display just the first few observations:
names <- sample(c("John", "Paul", "George", "Ringo"), 10, replace=TRUE)
ages <- sample(18:25, 10, replace=TRUE)
major <- sample(c("Undeclared", "Math", "Cyber", "Data", "Comp-Sci"), 10, replace=TRUE)
commuter <- sample(c(TRUE, FALSE), 10, replace=TRUE)
df <- data.frame(names, ages, major, commuter)
head(df, 5) # Retrieves the first 5 rows (6 by default) names ages major commuter
1 Paul 18 Math TRUE
2 George 22 Math TRUE
3 John 23 Comp-Sci TRUE
4 Ringo 21 Cyber TRUE
5 Ringo 19 Comp-Sci FALSE
Emmit is organizing a small conference and wants to keep track of attendees. Make a dataframe which looks exactly like the following list:
name age role registered
1 Adam 30 attendee TRUE
2 Brad 49 speaker TRUE
3 Claire 32 organizer TRUE
4 Donald 28 attendee FALSE
5 Elaine 27 attendee FALSE
6 Fiona 33 organizer TRUE
One nice thing about dataframes is that the individual columns will retain the same class type that the vector has. We can see the structure of the dataframe by passing the dataframe into the str() function.
str(df)'data.frame': 10 obs. of 4 variables:
$ names : chr "Paul" "George" "John" "Ringo" ...
$ ages : int 18 22 23 21 19 21 19 18 20 25
$ major : chr "Math" "Math" "Comp-Sci" "Cyber" ...
$ commuter: logi TRUE TRUE TRUE TRUE FALSE FALSE ...
We can also bind vectors together using the cbind() function, but we should be weary about this as the vector types will be altered to the “lowest” type if they are different. The function name cbind() stands for column bind, which will attach a new column on the end of another vector/dataframe. An example of this can be seen below where all of the columns are turned into characters
df2 <- data.frame(cbind(names, ages, major, commuter))
head(df2,5) names ages major commuter
1 Paul 18 Math TRUE
2 George 22 Math TRUE
3 John 23 Comp-Sci TRUE
4 Ringo 21 Cyber TRUE
5 Ringo 19 Comp-Sci FALSE
str(df2)'data.frame': 10 obs. of 4 variables:
$ names : chr "Paul" "George" "John" "Ringo" ...
$ ages : chr "18" "22" "23" "21" ...
$ major : chr "Math" "Math" "Comp-Sci" "Cyber" ...
$ commuter: chr "TRUE" "TRUE" "TRUE" "TRUE" ...
We can also attach a new row on the end of a dataframe using the rbind() function. Once again though, we should be wary about this as we can potentially alter the column types.
data_to_be_added <- c("Pete", 24, "Percussion", FALSE)
df_added <- rbind(df, data_to_be_added)
tail(df_added, 5) # Retrieves the last 5 observations of the dataframe names ages major commuter
7 Ringo 19 Cyber TRUE
8 George 18 Comp-Sci FALSE
9 Ringo 20 Undeclared TRUE
10 John 25 Data FALSE
11 Pete 24 Percussion FALSE
Using the dataframe you previously made for Emmit, add a new row to the list of attendees: Gracie who is 39, a speaker, and not registered.
We can learn about the dataframe’s properties using a few different functions. One function called dim() will tell us about the number of rows and columns in the dataset (the dimension). Meanwhile, nrow() will tell us the number of rows, and ncol() will tell us the number of columns in the dataframe.
dim(df)[1] 10 4
nrow(df)[1] 10
ncol(df)[1] 4
Using the dataframe you previously made for Emmit with Gracie now added, determine the number of rows and columns the dataframe has.
Sometimes it is helpful to re-name the columns of a dataframe if we do not like the current names of the vector. We can do this when we create the dataframe like in the example below:
x <- 0:9
y <- 10:19
z <- 20:29
df3 <- data.frame("singles"=x, "tens"=y, "twenties"=z)
head(df3) singles tens twenties
1 0 10 20
2 1 11 21
3 2 12 22
4 3 13 23
5 4 14 24
6 5 15 25
If we do not name them during the creation of the dataframe or we have a dataframe already in R and we want to rename the column names then we can do this using the colnames() function.
df3 <- data.frame(x,y,z)
colnames(df3)[1] "x" "y" "z"
colnames(df3) <- c("singles", "tens", "twenties")
head(df3,4) singles tens twenties
1 0 10 20
2 1 11 21
3 2 12 22
4 3 13 23
Emmit decides that the dataframe for the conference should have different header names. Alter the dataframe so the column names are now: “participant_name”, “participant_age”, “conference_role”, and “paid”.
Accessing dataframe elements will be similar to accessing vector elements except now we are dealing with a 2-dimensional object in R. Thus, we will need to specify both dimensions (row and column). You will hear me say over and over again in class: “Dataframes index by Row comma Column”. Within our index selection brackets, we will need to have the comma present. If we put nothing before the comma it will indicate all rows, while nothing after the comma will indicate all columns.
If we would wish to display the element in the second row and first column we would make sure we call our dataframe and then in the index selection brackets we would say ‘[2,1]’. If we wanted to display the 3rd observation (3rd row) then we could just say ‘[3,]’ in our index-selection brackets. To display all of the elements in the 2nd column we would say ‘[,2]’ in our index-selection brackets. An example of this can be seen below:
head(df, 4) names ages major commuter
1 Paul 18 Math TRUE
2 George 22 Math TRUE
3 John 23 Comp-Sci TRUE
4 Ringo 21 Cyber TRUE
df[2,1] # Element in the 2nd row and 1st column[1] "George"
df[3,] # Elements in the 3rd row names ages major commuter
3 John 23 Comp-Sci TRUE
df[,2] # Elements in the 2nd column [1] 18 22 23 21 19 21 19 18 20 25
We can also retrieve elements in a column by using a dollar sign ($) and then typing the column name. This resulting output will be a vector, not a dataframe, and will contain all of the values in that column.
df$names [1] "Paul" "George" "John" "Ringo" "Ringo" "Paul" "Ringo" "George"
[9] "Ringo" "John"
df$ages [1] 18 22 23 21 19 21 19 18 20 25
df$major [1] "Math" "Math" "Comp-Sci" "Cyber" "Comp-Sci"
[6] "Data" "Cyber" "Comp-Sci" "Undeclared" "Data"
df$commuter [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
We can retrieve multiple rows and/or columns by passing a vector into our index-selection brackets. It should also be noted that we can select vector elements on a vector result. Additionally, we can pass a logical vector into the index-selection brackets to display values that meet certain criteria. Examples of these can be seen below:
df[3:5, c(1,3)] # Rows 3 through 5 and columns 1 and 3 names major
3 John Comp-Sci
4 Ringo Cyber
5 Ringo Comp-Sci
df$ages [1] 18 22 23 21 19 21 19 18 20 25
df$ages[2:4][1] 22 23 21
df$ages <= 21 [1] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
df[df$ages <= 21, ] # Displays just the TRUE rows names ages major commuter
1 Paul 18 Math TRUE
4 Ringo 21 Cyber TRUE
5 Ringo 19 Comp-Sci FALSE
6 Paul 21 Data FALSE
7 Ringo 19 Cyber TRUE
8 George 18 Comp-Sci FALSE
9 Ringo 20 Undeclared TRUE
Emmit would like to display some key information from the conference attendees. First he would like to display all of the people that are speaking, then he would like to display all of the people who are under the age of 35, and finally he would like to display all of the people who are either attendees and have paid the registration fee.
Finally, we can remove a single row or column from our dataframe, but we want to be very careful of doing this as we might not be able to reverse the action. To remove a row or column we can simply overwrite our dataframe by stating our dataframe and then in our index-selection brackets indicate which row/column you want to remove with a negative sign. Both methods can be seen below:
df4 <- df[-2,] # Everything but the second row
head(df4,3) names ages major commuter
1 Paul 18 Math TRUE
3 John 23 Comp-Sci TRUE
4 Ringo 21 Cyber TRUE
df4 <- df[,-3] # Everything but the third column
head(df4,3) names ages commuter
1 Paul 18 TRUE
2 George 22 TRUE
3 John 23 TRUE
Emmit decides that he does not need to record the attendee’s age. He also found out that Adam is withdrawing from the conference. Help him remove the age column from the dataframe along with Adam’s information.