<- 12:16
x <- 80:84
y <- data.frame(x,y)
df df
x y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84
Within R, we can form structured data with multiple columns. These are called Dataframes and are comparable to the way a spreadsheet may look. Within each dataframe, multiple columns can be present, with each column being a vector. Additionally, the column types may vary, as we can have a numeric vector, a logical vector, and a character vector all in the same dataframe. It is important to talk about dataframes as this is the predominant data structure in R. Almost all of the datasets that we encounter will be formatted as a dataframe.
Within a dataframe, the rows will represent an observation. Additionally, each vector (column) will have the same length as all of the others, resulting in a “rectangular” looking dataframe. We can make one ourselves by specifying the vectors in the data.frame() function. An example of this can be seen below:
<- 12:16
x <- 80:84
y <- data.frame(x,y)
df df
x y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84
Another example can be seen below. In creating this dataframe we will utilize our sample() function to randomly generate a vector of length 10. When you run the same code you might have different values generated since it does so randomly. To display it I will use the head() function will display just the first few observations:
<- sample(c("John", "Paul", "George", "Ringo"), 10, replace=TRUE)
names <- sample(18:25, 10, replace=TRUE)
ages <- sample(c("Undeclared", "Math", "Cyber", "Data", "Comp-Sci"), 10, replace=TRUE)
major <- sample(c(TRUE, FALSE), 10, replace=TRUE)
commuter <- data.frame(names, ages, major, commuter)
df head(df, 5) # Retrieves the first 5 rows (6 by default)
names ages major commuter
1 Ringo 22 Math TRUE
2 Paul 22 Data TRUE
3 John 18 Comp-Sci FALSE
4 George 23 Cyber TRUE
5 Ringo 21 Math FALSE
Emmit is organizing a small conference and wants to keep track of attendees. Make a dataframe which looks exactly like the following list:
name age role registered
1 Adam 30 attendee TRUE
2 Brad 49 speaker TRUE
3 Claire 32 organizer TRUE
4 Donald 28 attendeee FALSE
5 Elaine 27 attendee FALSE
6 Fiona 33 organizer TRUE
XXXX INSERT VIDEO XXXX
One nice thing about dataframes is that the individual columns will retain the same class type that the vector has. We can see the structure of the dataframe by passing the dataframe into the str() function.
str(df)
'data.frame': 10 obs. of 4 variables:
$ names : chr "Ringo" "Paul" "John" "George" ...
$ ages : int 22 22 18 23 21 21 18 25 18 20
$ major : chr "Math" "Data" "Comp-Sci" "Cyber" ...
$ commuter: logi TRUE TRUE FALSE TRUE FALSE TRUE ...
We can also bind vectors together using the cbind() function, but we should be weary about this as the vector types will be altered to the “lowest” type if they are different. The function name cbind() stands for column bind, which will attach a new column on the end of another vector/dataframe. An example of this can be seen below where all of the columns are turned into characters
<- data.frame(cbind(names, ages, major, commuter))
df2 head(df2,5)
names ages major commuter
1 Ringo 22 Math TRUE
2 Paul 22 Data TRUE
3 John 18 Comp-Sci FALSE
4 George 23 Cyber TRUE
5 Ringo 21 Math FALSE
str(df2)
'data.frame': 10 obs. of 4 variables:
$ names : chr "Ringo" "Paul" "John" "George" ...
$ ages : chr "22" "22" "18" "23" ...
$ major : chr "Math" "Data" "Comp-Sci" "Cyber" ...
$ commuter: chr "TRUE" "TRUE" "FALSE" "TRUE" ...
We can also attach a new row on the end of a dataframe using the rbind() function. Once again though, we should be wary about this as we can potentially alter the column types.
<- c("Pete", 24, "Percussion", FALSE)
data_to_be_added <- rbind(df, data_to_be_added)
df_added tail(df_added, 5) # Retrieves the last 5 observations of the dataframe
names ages major commuter
7 George 18 Undeclared FALSE
8 Paul 25 Math TRUE
9 John 18 Comp-Sci TRUE
10 John 20 Cyber TRUE
11 Pete 24 Percussion FALSE
Using the dataframe you previously made for Emmit, add a new row to the list of attendees: Gracie who is 39, a speaker, and not registered.
XXXX INSERT VIDEO XXXX
We can learn about the dataframe’s properties using a few different functions. One function called dim() will tell us about the number of rows and columns in the dataset (the dimension). Meanwhile, nrow() will tell us the number of rows, and ncol() will tell us the number of columns in the dataframe.
dim(df)
[1] 10 4
nrow(df)
[1] 10
ncol(df)
[1] 4
Using the dataframe you previously made for Emmit with Gracie now added, determine the number of rows and columns the dataframe has.
XXXX INSERT VIDEO XXXX
Sometimes it is helpful to re-name the columns of a dataframe if we do not like the current names of the vector. We can do this when we create the dataframe like in the example below:
<- 0:9
x <- 10:19
y <- 20:29
z
<- data.frame("singles"=x, "tens"=y, "twenties"=z)
df3 head(df3)
singles tens twenties
1 0 10 20
2 1 11 21
3 2 12 22
4 3 13 23
5 4 14 24
6 5 15 25
If we do not name them during the creation of the dataframe or we have a dataframe already in R and we want to rename the column names then we can do this using the colnames() function.
<- data.frame(x,y,z)
df3 colnames(df3)
[1] "x" "y" "z"
colnames(df3) <- c("singles", "tens", "twenties")
head(df3,4)
singles tens twenties
1 0 10 20
2 1 11 21
3 2 12 22
4 3 13 23
Emmit decides that the dataframe for the conference should have different header names. Alter the dataframe so the column names are now: “participant_name”, “participant_age”, “conference_role”, and “paid”.
XXXX INSERT VIDEO XXXX
Accessing dataframe elements will be similar to accessing vector elements except now we are dealing with a 2-dimensional object in R. Thus, we will need to specify both dimensions (row and column). You will hear me say over and over again in class: “Dataframes index by Row comma Column”. Within our index selection brackets, we will need to have the comma present. If we put nothing before the comma it will indicate all rows, while nothing after the comma will indicate all columns.
If we would wish to display the element in the second row and first column we would make sure we call our dataframe and then in the index selection brackets we would say ‘[2,1]’. If we wanted to display the 3rd observation (3rd row) then we could just say ‘[3,]’ in our index-selection brackets. To display all of the elements in the 2nd column we would say ‘[,2]’ in our index-selection brackets. An example of this can be seen below:
head(df, 4)
names ages major commuter
1 Ringo 22 Math TRUE
2 Paul 22 Data TRUE
3 John 18 Comp-Sci FALSE
4 George 23 Cyber TRUE
2,1] # Element in the 2nd row and 1st column df[
[1] "Paul"
3,] # Elements in the 3rd row df[
names ages major commuter
3 John 18 Comp-Sci FALSE
2] # Elements in the 2nd column df[,
[1] 22 22 18 23 21 21 18 25 18 20
We can also retrieve elements in a column by using a dollar sign ($) and then typing the column name. This resulting output will be a vector, not a dataframe, and will contain all of the values in that column.
$names df
[1] "Ringo" "Paul" "John" "George" "Ringo" "George" "George" "Paul"
[9] "John" "John"
$ages df
[1] 22 22 18 23 21 21 18 25 18 20
$major df
[1] "Math" "Data" "Comp-Sci" "Cyber" "Math"
[6] "Data" "Undeclared" "Math" "Comp-Sci" "Cyber"
$commuter df
[1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
We can retrieve multiple rows and/or columns by passing a vector into our index-selection brackets. It should also be noted that we can select vector elements on a vector result. Additionally, we can pass a logical vector into the index-selection brackets to display values that meet certain criteria. Examples of these can be seen below:
3:5, c(1,3)] # Rows 3 through 5 and columns 1 and 3 df[
names major
3 John Comp-Sci
4 George Cyber
5 Ringo Math
$ages df
[1] 22 22 18 23 21 21 18 25 18 20
$ages[2:4] df
[1] 22 18 23
$ages <= 21 df
[1] FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
$ages <= 21, ] # Displays just the TRUE rows df[df
names ages major commuter
3 John 18 Comp-Sci FALSE
5 Ringo 21 Math FALSE
6 George 21 Data TRUE
7 George 18 Undeclared FALSE
9 John 18 Comp-Sci TRUE
10 John 20 Cyber TRUE
Emmit would like to display some key information from the conference attendees. First he would like to display all of the people that are speaking, then he would like to display all of the people who are under the age of 35, and finally he would like to display all of the people who are either attendees and have paid the registration fee.
XXXX INSERT VIDEO XXXX
Finally, we can remove a single row or column from our dataframe, but we want to be very careful of doing this as we might not be able to reverse the action. To remove a row or column we can simply overwrite our dataframe by stating our dataframe and then in our index-selection brackets indicate which row/column you want to remove with a negative sign. Both methods can be seen below:
<- df[-2,] # Everything but the second row
df4 head(df4,3)
names ages major commuter
1 Ringo 22 Math TRUE
3 John 18 Comp-Sci FALSE
4 George 23 Cyber TRUE
<- df[,-3] # Everything but the third column
df4 head(df4,3)
names ages commuter
1 Ringo 22 TRUE
2 Paul 22 TRUE
3 John 18 FALSE
Emmit decides that he does not need to record the attendee’s age. He also found out that Adam is withdrawing from the conference. Help him remove the age column from the dataframe along with Adam’s information.
XXXX INSERT VIDEO XXXX