4 Vectors and Factors in R

4.1 Data Types in R

So far we have seen how we can create variables in R. It is important to note that all objects in R have a type (the big three are Doubles, Characters, and Logical). A benefit to R is that we do not have to declare each variable and type, as it can figure all of that information out without us needing to do anything. Doubles consist of our numbers, which can be integers or decimal values. This is not the same as floating point values in other languages though. We can coerce the value to be an integer by placing an “L” after it (but there is no real benefit right now for integers over doubles so don’t worry about it). Characters will be anything in quotes, whether it is a single letter, a word, a paragraph, a symbol, or a number. R will treat all characters the same. The last type we should discuss is the Logical type, which consists of TRUE and FALSE (yes it has to be in all caps) or the shortened T and F. If we are ever unsure of the type we are dealing with, we can use the typeof() function and R will tell us.

typeof(17.2)

[1] "double"

typeof(17L)

[1] "integer"

typeof(17)

[1] "double"

typeof("17")

[1] "character"

typeof(TRUE)

[1] "logical"

It is critical to know what type our value is as it will determine what functions we can use, how we compose expressions, and how we will interpret the results. For instance, we cannot do “math” on characters but we can do “math” with doubles and logicals (though it might not always make sense to do this).

"a" + "b"

Error in "a" + "b" : non-numeric argument to binary operator

3 + TRUE # TRUE is equal to 1 and FALSE is equal to 0

[1] 4

"abc" + 2

Error in "abc" + 2 : non-numeric argument to binary operator

One thing that makes R different then some other languages is that R is “vectorized” (meaning everything is a vector). This characteristic will allow us to quickly work with entire sets of data and avoids the need for iterations or loops (but the for and while loop still exist). So remember, Everything is a Vector!!!

4.2 Vectors

To create a vector containing multiple elements we will use the combine function \(c()\). This will allow us to combine multiple single-element vectors into a multi-element vector. An example of this process can be seen below:

x <- c(1,2,3,4)
x

[1] 1 2 3 4

y <- c("Hello", "my name is", "Dr. McCurdy")
y

[1] "Hello"       "my name is"  "Dr. McCurdy"

If we create a vector with multiple types of data in it then it reverts to the “lowest” one present (character \(<\) double \(<\) logical). So, if at least one character is present then all of the elements are turned into characters, and if no characters are present but a double is then all elements turn into doubles. We can see the type a vector is using the class() function.

x <- c(1,2,"3",4,5) # One character is present
class(x)

[1] "character"

[1] "1" "2" "3" "4" "5"

y <- c(1, 2, TRUE, 4, 5) # TRUE is turned into a 1
class(y)

[1] "numeric"

[1] 1 2 1 4 5

If we need to explicitly coerce a vector to be a certain type we can (for the most part). It is the nature of vectors to apply functions on every element, so using a function like as.numeric() or as.character() will try and force all elements to become a certain type.

x <- as.numeric(c(1,2,"3",4,5)) # 3 is coerced into a number
class(x)

[1] "numeric"

[1] 1 2 3 4 5

y <- as.character(c(1,2,3,4,5)) # All items are coerced into characters
class(y)

[1] "character"

[1] "1" "2" "3" "4" "5"

Whenever we are dealing with vectors it is important to resist the urge to fight the vector’s nature. That is, don’t try and overthink it or make it more difficult for yourself, R was meant to handle vectors. All arithmetic operators can be used on vectors. It will apply the operation between comparable elements and will “recycle” the shorter vector if needed.

x <- c(1,2,3,4,5)
x

[1] 1 2 3 4 5

x + 13

[1] 14 15 16 17 18

x * -2

[1]  -2  -4  -6  -8 -10

x <- c(1,2,3)
y <- c(5,6,7)

x + y

[1]  6  8 10

x^y

[1]    1   64 2187

x <- c(1,2,3,4,5)
y <- c(1,10,100)
x + y

Warning in x + y: longer object length is not a multiple of shorter object
length

[1]   2  12 103   5  15

We can also make vectors that are not numeric. These can be both character strings or logical values.

x <- c("this", "is", "a", "character vector!")
x

[1] "this"              "is"                "a"                
[4] "character vector!"

y <- "this is also a character vector!"
y

[1] "this is also a character vector!"

z <- c(T, T, F, F, T, F, T)
z

[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

4.3 Logical Operators

One of the most important things we do in this class (and we will repeatedly do it throughout the semester) is using logical operators on vectors. If we compare 2 vectors using logical operators, the result will be a logical vector. There are a few big ones that we will need to know, and they are pretty well known. For instance, \(<\) means less than, \(>\) means greater than, \(<=\) means less than or equal, \(>=\) means greater than or equal, \(==\) means equal (notice it is two equal signs), and \(!=\) means not equal.

x <- c(0, 4, 2, 5, 3, 6)
x

[1] 0 4 2 5 3 6

x == 3 # Checks each element to see if it is equal to 3

[1] FALSE FALSE FALSE FALSE  TRUE FALSE

x > 4 # Checks each element to see if it is greater than 4

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE

4.4 Built-in statistical functions

Within R, there are also built-in functions that will make our lives easier. All of the functions in this section will require us to pass a vector into it and it will output a vector of the same size or of size 1 or 2. The names for these functions resemble the function name that we would say out loud.

sqrt(182)

[1] 13.49074

x <- c(2,49,381)
sqrt(x)

[1]  1.414214  7.000000 19.519221

log(x)

[1] 0.6931472 3.8918203 5.9427994

There are also functions in R that will do all of the calculations we did in the earlier lectures. Unfortunately, there is no function to calculate the mode of a dataset. When working with these functions, it is important to make sure we pass a vector into the functions. This is a common error students make when they start off.

x <- c(12, 10, 18, 11, 11, 15, 21)

sum(x)

[1] 98

mean(x)

[1] 14

median(x)

[1] 12

sd(x)

[1] 4.163332

We can determine how long a vector is using the function. We could then use this function along with the sum function to calculate the mean of the dataset:

length(4)

[1] 1

length(c(4,8,2))

[1] 3

length(x)

[1] 7

sum(x)/length(x) # Notice this is the mean we found above

[1] 14

In addition to this, we can calculate the minimum, maximum, and range difference of a vector. We say range difference since the range() function provides both the minimum and maximum value, while the diff() function gives us the difference between the two values:

[1] 12 10 18 11 11 15 21

sort(x)

[1] 10 11 11 12 15 18 21

min(x)

[1] 10

max(x)

[1] 21

range(x)

[1] 10 21

diff(range(x))

[1] 11

4.5 Indexing in R

The last big thing we want to mention about vectors is how the elements are indexed. R starts indexing at 1 (this means the first element is index 1, the second element is index 2, and so on) which is different then other languages which start counting at 0. To access specific values we can place the indices in brackets. I will often refer to these as our “Index-Selection” brackets. We must pass a vector into our index-selection brackets using the combine function c() if we have multiple elements that we want to display.

[1] 12 10 18 11 11 15 21

x[2]

[1] 10

x[length(x)]

[1] 21

x[c(1,3,6)]

[1] 12 18 15

4.6 Creating Factors in R

So far we have seen a couple of different types of data in R, including numeric, logical, and character types. Another type that we will use quite a bit are factors. Factors will allow us to represent categorical data that fit in only a finite number of distinct categories. These factors could either be words (like Small, Medium, or Large) or numbers (like a house having 1, 2, or 3 bedrooms). R does some things behind the scenes to make it efficient to store this qualitative data, but the only thing we really need to know is that if we have categorical data then we should probably save it as a factor using the factor() function.

Let’s create a vector containing a random sample of “Males” and “Females” using the sample() function. First, we should realize that this is categorical data, so we will want to convert it to a factor. We can then use the factor() function to convert it to categories. Notice how the levels are in alphabetical order.

x <- sample(c("Male", "Female"), 7, replace=TRUE)
x

[1] "Male"   "Male"   "Female" "Male"   "Female" "Male"   "Female"

fx <- factor(x)
fx

[1] Male   Male   Female Male   Female Male   Female
Levels: Female Male

We can also notice how the vector class and the element type are also altered. We don’t have to understand the “behind-the-scenes” happenings of why the element type is now an integer. But we should emphasize though that the new vector is no longer a character vector, rather it is a factor since it is categorical data.

class(x)

[1] "character"

class(fx)

[1] "factor"

typeof(x)

[1] "character"

typeof(fx)

[1] "integer"

4.7 Dealing with Ordinal Data

If we remember from our previous lectures, categorical data can fall into two types: Nominal and Ordinal. Right now our factor() function is returning the values in a nominal form (meaning the factors have no specific ordering) and thus we cannot compare values.

fx[1] > fx[2]

Warning in Ops.factor(fx[1], fx[2]) : '>' not meaningful for factors

[1] NA

If we want to create ordinal data in R then all we have to do is specify the ‘ordered’ argument to be TRUE. An example of this can be seen below, but note that the levels are in alphabetical order by default.

y <- sample(c("Small", "Medium", "Large"), 7, replace=TRUE)
y

[1] "Large"  "Medium" "Medium" "Large"  "Large"  "Medium" "Medium"

fy <- factor(y, ordered=TRUE)
fy

[1] Large  Medium Medium Large  Large  Medium Medium
Levels: Large < Medium

This is obviously not what we want. When we go to a restaurant and order a side of fries, we expect the Large to be bigger than the Medium which is bigger than the Small. To have the levels end up in the correct order, we will want to use the ‘levels’ argument and pass it a vector of the correct ordering. When you do this, the spelling has to be the same (even the capitalization) for R to recognize it. Now that we have an ordered factor, we can compare values using logical operators.

fy <- factor(y, ordered=TRUE, levels=c("Small", "Medium", "Large"))
fy

[1] Large  Medium Medium Large  Large  Medium Medium
Levels: Small < Medium < Large

class(fy)

[1] "ordered" "factor"

fy[1] > fy[2]

[1] TRUE

Additionally, we can always see the possible categories an element can have in the factor by using the levels() function. Finally, if we wish to change the levels then we can do so by assigning a new character vector to the current levels of factor. Be careful with this, as we do not want to pass in the order of the vector elements, rather we want to pass in the order of the levels.

fy

[1] Large  Medium Medium Large  Large  Medium Medium
Levels: Small < Medium < Large

levels(fy)

[1] "Small"  "Medium" "Large"

levels(fy) <- c("mini", "regular", "huge")
fy

[1] huge    regular regular huge    huge    regular regular
Levels: mini < regular < huge

4.8 RStudio

So far, we have been using the R Console (RGui) to run code. While this is sufficient to run all of our R code, we are going to introduce a program called RStudio. This is the most commonly used IDE (Integrated Development Environment) for R which will allow us to be more organized and better for trouble-shooting our code. It is maintained by Posit and can be downloaded . Since you already have R installed, you will only need to download the RStudio Desktop. Once this is done it should look something like this:

We will run our code in the bottom left panel with the option to save our code in the top left panel using a script or text file (more to come on this in future lectures). In the top right panel we will be able to see the variables saved in our R environment along with a history of our code. In the bottom right panel we will be able to view plots, read documentation, and browse files on our computer.