<- 12:16
x <- 80:84
y <- data.frame(x,y)
df df
x y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84
Within R, we can form structured data with multiple columns. These are called Dataframes and are comparable to the way a spreadsheet may look. Within each dataframe, multiple columns can be present, with each column being a vector. Additionally, the column types may vary, as we can have a numeric vector, a logical vector, and a character vector all in the same dataframe. It is important to talk about dataframes as this is the predominant data structure in R. Almost all of the datasets that we encounter will be formatted as a dataframe.
Within a dataframe, the rows will represent an observation. Additionally, each vector (column) will have the same length as all of the others, resulting in a “rectangular” looking dataframe. We can make one ourselves by specifying the vectors in the data.frame() function. An example of this can be seen below:
<- 12:16
x <- 80:84
y <- data.frame(x,y)
df df
x y
1 12 80
2 13 81
3 14 82
4 15 83
5 16 84
Another example can be seen below. In creating this dataframe we will utilize our sample() function to randomly generate a vector of length 10. When you run the same code you might have different values generated since it does so randomly. To display it I will use the head() function will display just the first few observations:
<- sample(c("John", "Paul", "George", "Ringo"), 10, replace=TRUE)
names <- sample(18:25, 10, replace=TRUE)
ages <- sample(c("Undeclared", "Math", "Cyber", "Data", "Comp-Sci"), 10, replace=TRUE)
major <- sample(c(TRUE, FALSE), 10, replace=TRUE)
commuter <- data.frame(names, ages, major, commuter)
df head(df, 5) # Retrieves the first 5 rows (6 by default)
names ages major commuter
1 Ringo 19 Cyber FALSE
2 George 18 Cyber TRUE
3 Paul 20 Cyber FALSE
4 Ringo 23 Cyber TRUE
5 Paul 22 Cyber TRUE
One nice thing about dataframes is that the individual columns will retain the same class type that the vector has. We can see the structure of the dataframe by passing the dataframe into the str() function.
str(df)
'data.frame': 10 obs. of 4 variables:
$ names : chr "Ringo" "George" "Paul" "Ringo" ...
$ ages : int 19 18 20 23 22 25 18 22 21 22
$ major : chr "Cyber" "Cyber" "Cyber" "Cyber" ...
$ commuter: logi FALSE TRUE FALSE TRUE TRUE FALSE ...
We can also bind vectors together using the cbind() function, but we should be weary about this as the vector types will be altered to the “lowest” type if they are different. The function name cbind() stands for column bind, which will attach a new column on the end of another vector/dataframe. An example of this can be seen below where all of the columns are turned into characters
<- data.frame(cbind(names, ages, major, commuter))
df2 head(df2,5)
names ages major commuter
1 Ringo 19 Cyber FALSE
2 George 18 Cyber TRUE
3 Paul 20 Cyber FALSE
4 Ringo 23 Cyber TRUE
5 Paul 22 Cyber TRUE
str(df2)
'data.frame': 10 obs. of 4 variables:
$ names : chr "Ringo" "George" "Paul" "Ringo" ...
$ ages : chr "19" "18" "20" "23" ...
$ major : chr "Cyber" "Cyber" "Cyber" "Cyber" ...
$ commuter: chr "FALSE" "TRUE" "FALSE" "TRUE" ...
We can also attach a new row on the end of a dataframe using the rbind() function. Once again though, we should be wary about this as we can potentially alter the column types.
<- c("Pete", 24, "Percussion", FALSE)
data_to_be_added <- rbind(df, data_to_be_added)
df_added tail(df_added, 5) # Retrieves the last 5 observations of the dataframe
names ages major commuter
7 Ringo 18 Math TRUE
8 John 22 Cyber FALSE
9 Ringo 21 Cyber TRUE
10 Ringo 22 Math FALSE
11 Pete 24 Percussion FALSE
We can learn about the dataframe’s properties using a few different functions. One function called dim() will tell us about the number of rows and columns in the dataset (the dimension). Meanwhile, nrow() will tell us the number of rows, and ncol() will tell us the number of columns in the dataframe.
dim(df)
[1] 10 4
nrow(df)
[1] 10
ncol(df)
[1] 4
Sometimes it is helpful to name the columns of a dataframe if we do not like the current names of the vector. We can do this when we create the dataframe like in the example below:
<- 0:9
x <- 10:19
y <- 20:29
z
<- data.frame("singles"=x, "tens"=y, "twenties"=z)
df3 head(df3)
singles tens twenties
1 0 10 20
2 1 11 21
3 2 12 22
4 3 13 23
5 4 14 24
6 5 15 25
If we do not name them during the creation of the dataframe or we have a dataframe already in R and we want to rename the column names then we can do this using the colnames() function.
<- data.frame(x,y,z)
df3 colnames(df3)
[1] "x" "y" "z"
colnames(df3) <- c("singles", "tens", "twenties")
head(df3,4)
singles tens twenties
1 0 10 20
2 1 11 21
3 2 12 22
4 3 13 23
Accessing dataframe elements will be similar to accessing vector elements except now we are dealing with a 2-dimensional object in R. Thus, we will need to specify both dimensions (row and column). You will hear me say over and over again in class: “Dataframes index by Row comma Column”. Within our index selection brackets, we will need to have the comma present. If we put nothing before the comma it will indicate all rows, while nothing after the comma will indicate all columns.
If we would wish to display the element in the second row and first column we would make sure we call our dataframe and then in the index selection brackets we would say ‘[2,1]’. If we wanted to display the 3rd observation (3rd row) then we could just say ‘[3,]’ in our index-selection brackets. To display all of the elements in the 2nd column we would say ‘[,2]’ in our index-selection brackets. An example of this can be seen below:
head(df, 4)
names ages major commuter
1 Ringo 19 Cyber FALSE
2 George 18 Cyber TRUE
3 Paul 20 Cyber FALSE
4 Ringo 23 Cyber TRUE
2,1] # Element in the 2nd row and 1st column df[
[1] "George"
3,] # Elements in the 3rd row df[
names ages major commuter
3 Paul 20 Cyber FALSE
2] # Elements in the 2nd column df[,
[1] 19 18 20 23 22 25 18 22 21 22
We can also retrieve elements in a column by using a dollar sign ($) and then typing the column name. This resulting output will be a vector, not a dataframe, and will contain all of the values in that column.
$names df
[1] "Ringo" "George" "Paul" "Ringo" "Paul" "George" "Ringo" "John"
[9] "Ringo" "Ringo"
$ages df
[1] 19 18 20 23 22 25 18 22 21 22
$major df
[1] "Cyber" "Cyber" "Cyber" "Cyber" "Cyber" "Cyber" "Math" "Cyber" "Cyber"
[10] "Math"
$commuter df
[1] FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
We can retrieve multiple rows and/or columns by passing a vector into our index-selection brackets. It should also be noted that we can select vector elements on a vector result. Additionally, we can pass a logical vector into the index-selection brackets to display values that meet certain criteria. Examples of these can be seen below:
3:5, c(1,3)] # Rows 3 through 5 and columns 1 and 3 df[
names major
3 Paul Cyber
4 Ringo Cyber
5 Paul Cyber
$ages df
[1] 19 18 20 23 22 25 18 22 21 22
$ages[2:4] df
[1] 18 20 23
$ages <= 21 df
[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
$ages <= 21, ] # Displays just the TRUE rows df[df
names ages major commuter
1 Ringo 19 Cyber FALSE
2 George 18 Cyber TRUE
3 Paul 20 Cyber FALSE
7 Ringo 18 Math TRUE
9 Ringo 21 Cyber TRUE
Finally, we can remove a single row or column from our dataframe, but we want to be very careful of doing this as we might not be able to reverse the action. To remove a row or column we can simply overwrite our dataframe by stating our dataframe and then in our index-selection brackets indicate which row/column you want to remove with a negative sign. Both methods can be seen below:
<- df[-2,] # Everything but the second row
df4 head(df4,3)
names ages major commuter
1 Ringo 19 Cyber FALSE
3 Paul 20 Cyber FALSE
4 Ringo 23 Cyber TRUE
<- df[,-3] # Everything but the third column
df4 head(df4,3)
names ages commuter
1 Ringo 19 FALSE
2 George 18 TRUE
3 Paul 20 FALSE