5  More R

So far we have seen the basics of R, including ways to use R as a calculator as well as how to create variables and vectors. As a quick reminder, we can create a vector containing multiple elements by using our c() function (which stands for combine). After creating a vector we could then save it to a variable name by having the value point towards the name of our variable. To do this, we will use the assignment operator which looks like this: \(<\)–. An example of this can be seen below:

x <- c(1,2,3,4,5) # Combining multiple elements into a single vector 
x
[1] 1 2 3 4 5
c("Hello", "world!") -> y # This also works but is not recommended
y
[1] "Hello"  "world!"

5.1 Creating a Sequence of Numbers

There are additional ways we could make a vector as well. If we wanted all of our numbers in a sequence then we could use the seq() function to do this. With this function, we can specify what our starting value is and what we want the sequence to do. The function also allows us to pass an argument into it which specifies what value we should increment the sequence by. If we do not specify what we should increment by then it will automatically default to 1. We can also count down if we would like. Finally, it should be mentioned that most functions do not require us to write the argument name as long as we pass them in the correct order (but we should probably keep doing it).

seq(from=1, to=10)
 [1]  1  2  3  4  5  6  7  8  9 10
seq(from=1, to=10, by=2)
[1] 1 3 5 7 9
seq(7, 1)
[1] 7 6 5 4 3 2 1
seq(from=30, to=1, by=-4)
[1] 30 26 22 18 14 10  6  2
seq(-1,2,by=0.3)
 [1] -1.0 -0.7 -0.4 -0.1  0.2  0.5  0.8  1.1  1.4  1.7  2.0

If we do not need to increment the sequence by a certain value then we could do something similar with a colon. This will create a sequence by just adding 1 to a value until it reaches the end number. An example of this can be seen below:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
7:-2
 [1]  7  6  5  4  3  2  1  0 -1 -2
0.25:7.75 # Notice how it does not go past the ending value
[1] 0.25 1.25 2.25 3.25 4.25 5.25 6.25 7.25

5.2 Replicating Values

We can replicate values (of any type) using the rep() function. This will allow us to pass a vector into the function and have it be replicated a certain number of times (and the number of times could also be a vector!). If we pass a vector in for the number of times then it will match it element by element and replicate it the specified number of times before going on to the next element.

rep(2, times=6)
[1] 2 2 2 2 2 2
rep("abc", times=3)
[1] "abc" "abc" "abc"
rep(1:4, times=4:1) # 1st element 4 times, 2nd element 3 times, etc.
 [1] 1 1 1 1 2 2 2 3 3 4
rep(c(7,2,1), times=c(1,4,8))
 [1] 7 2 2 2 2 1 1 1 1 1 1 1 1

The function allows for other arguments as well, such as the length of the outputted sequence and if all of the elements should be replicated a certain amount of times. The ‘each’ argument will replicate each element a certain number of times while the ‘length.out’ argument will keep repeating a sequence until it is of a certain length.

rep(c(7,2,1), times=2)
[1] 7 2 1 7 2 1
rep(c(7,2,1), each=2)
[1] 7 7 2 2 1 1
rep(c(7,2,1), length.out = 8)
[1] 7 2 1 7 2 1 7 2
rep(c(7,2,1), each=3, length.out=10)
 [1] 7 7 7 2 2 2 1 1 1 7

5.3 Creating a Vector of Letters

As a quick reminder (since it is very important), R starts indexing at 1. This means that the first element in a vector is in the ‘1’ index. Other languages, like Python, start counting at 0 but we will start counting at 1 in R. In R there is a vector called ‘letters’ that contains all of the lower-case letters. If we want to find out what the 4th letter is then we can use our “index-selection” brackets. We can also pass a vector into the index selection brackets as well, which includes sequences and replicated vectors (as long as they are numeric). There is also a vector in R called `LETTERS’ which acts the same way but contains all capitalized letters.

x <- letters # Saving the vector to 'x' for simplicity
x
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
x[c(13, 15, 21, 14, 20)]
[1] "m" "o" "u" "n" "t"
x[17:22]
[1] "q" "r" "s" "t" "u" "v"
LETTERS[c(8, 5, 12, 12, 15)]
[1] "H" "E" "L" "L" "O"

If we ever wish to have all of the values displayed except for certain ones then we can put a negative in front of the value/vector and it will display everything except those values. This is helpful when it is easier to exclude certain indices instead of having to specify all of the desired indices.

x[seq(1,26,by=2)] # Every other letter
 [1] "a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y"
x[-seq(1,26,by=2)] # Everything but these indices
 [1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
y <- 1:10
y
 [1]  1  2  3  4  5  6  7  8  9 10
y[c(1,5,8)]
[1] 1 5 8
y[-c(1,5,8)]
[1]  2  3  4  6  7  9 10

5.4 Named Vectors

Since we have been discussing vectors and their elements, we should go ahead and mention that we can name the individual elements as well (this might make it easier for us to refer back to a specific element as we might not remember which index it is). There are a few different ways we can do this, the first is by naming them when we create the vector itself. To do this we will just put the name of the element to the left of the element with an equals sign in between. It may look something like this:

x <- c(M="Monday", W="Wednesday", F="Friday")
x
          M           W           F 
   "Monday" "Wednesday"    "Friday" 
names(x)
[1] "M" "W" "F"

Another way that might be beneficial is to name them after the vector has been created using the names() function. We will reference the names function with the vector inside and we will assign a character vector to it. This will update the names of the vector elements as whatever we passed into it. An example of this can also be seen below:

x <- c("Monday", "Wednesday", "Friday")
names(x)
NULL
names(x) <- c("M", "W", "F")
x
          M           W           F 
   "Monday" "Wednesday"    "Friday" 
names(x)
[1] "M" "W" "F"

Because the elements are named, we can pass the names into our index-selection brackets and R will output the element associated with that particular name. We will also still be able to access them with the index value as well.

x["M"]
       M 
"Monday" 
x[c("M", "F")]
       M        F 
"Monday" "Friday" 
x[c(1,3)]
       M        F 
"Monday" "Friday" 

5.5 Index Selection using GREP

One very powerful tool in R that allows us to search a string for a specific pattern is the grep() function. This stands for Global/Regular Expression/Print and is important to us as it will allow us to identify all of the elements containing a specific pattern. To see this function in action we will utilize the “euro” vector which is a named vector available to us in R.

euro
        ATS         BEF         DEM         ESP         FIM         FRF 
  13.760300   40.339900    1.955830  166.386000    5.945730    6.559570 
        IEP         ITL         LUF         NLG         PTE 
   0.787564 1936.270000   40.339900    2.203710  200.482000 

Before we jump into using the function though, we will want to discuss some of the syntax that the grep() uses. The first is that it expects us to pass a pattern into it using quotation marks. If we type a \(\wedge\) at the beginning of the pattern then it will search for strings starting with the pattern. If we type a $ at the end of the pattern then it will search for strings ending with the pattern. A single period will stand for any character, and brackets characters in brackets will mean “any of these characters”. After we specify the pattern we will need to also specify where we are looking for the pattern, and in our case, it will be the names of the euro vector. The output for this function will be the indices of the elements which contain the specified pattern.

names(euro)
 [1] "ATS" "BEF" "DEM" "ESP" "FIM" "FRF" "IEP" "ITL" "LUF" "NLG" "PTE"
grep("E", names(euro)) # Indices of elements containing an E anywhere
[1]  2  3  4  7 11
euro[grep("E", names(euro))] # Names containing an E anywhere
       BEF        DEM        ESP        IEP        PTE 
 40.339900   1.955830 166.386000   0.787564 200.482000 
grep("^I", names(euro)) # Indices of elements starting with I
[1] 7 8
euro[grep("^I", names(euro))] # Names starting with I
        IEP         ITL 
   0.787564 1936.270000 
grep("F$", names(euro)) # Indices of elements ending with F
[1] 2 6 9
euro[grep("F$", names(euro))] # Names ending with F
     BEF      FRF      LUF 
40.33990  6.55957 40.33990 
grep(".E.", names(euro)) # Indices of elements containing _E_
[1] 2 3 7
euro[grep(".E.", names(euro))] # Names containing with _E_
      BEF       DEM       IEP 
40.339900  1.955830  0.787564 
grep(".[EI].", names(euro)) # Indices of elements containing _E_ or _I_
[1] 2 3 5 7
euro[grep(".[EI].", names(euro))] # Names containing _E_ or _I_
      BEF       DEM       FIM       IEP 
40.339900  1.955830  5.945730  0.787564 

While this function may seem a little complicated at first, it is a very powerful tool that will allow us to filter out observations that meet certain criteria. One example might be searching a vector for all observations with the last name “Smith” or for people whose first name starts with the letters “Ca”.

5.6 Logical Vectors and Index Selection

We briefly saw it in the previous lecture, but it is important for us to practice accessing vector elements using logical vectors. If any logical operator (\(<,<=, >, >=, ==, !=\)) is used to compare vectors, the resulting output will be a logical vector. This will always be the case! It is good practice to think about what the outputted results will be before we even run the code instead of just “hoping for the best”.

We can access the vector elements using logical vectors in two ways; explicitly and implicitly. Doing it explicitly would mean we save the logical vector to a new variable and then use the new variable in our index-selection brackets while doing it implicitly would mean we place the logical comparison into the index-selection brackets directly. I prefer the implicit method as we do not have to re-run the code if the original vector changes. Both examples can be seen below:

x <- c(4,8,2,6,7,6,3,8,6)
y <- x > 6

y
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
x[y] # Explicit creation
[1] 8 7 8
x[x>6] # Implicit creation
[1] 8 7 8
x
[1] 4 8 2 6 7 6 3 8 6
x >= 7
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
length(x >= 7) # Length of the vector is 11 elements
[1] 9
sum(x >= 7) # Adding the logical vector: TRUE = 1, FALSE = 0
[1] 3
x[x >= 7] # Displaying just the values whose index is TRUE
[1] 8 7 8

We can expand our capabilities with logical operators and introduce a few new operators which will allow us to evaluate multiple conditions at the same time. These include & (which represents and), \(\vert\) (which represents or and is the pipe symbol above the enter key), and ! (which represents not). The and will require both conditions to be true for the output to be true, the or only requires one condition to be true for the output to be true, and the not “flips” the final output to be the opposite.

x <- 1:11
x
 [1]  1  2  3  4  5  6  7  8  9 10 11
(x >5) & (x < 10) # Is the element greater than 5 and less than 10
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
x[(x >5) & (x < 10)] # Displaying the elements that are TRUE
[1] 6 7 8 9
(x < 5) | (x > 9) # Is the elements less than 5 or greater than 9
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
x[(x < 5) | (x > 9)] # Displaying the elements that are TRUE
[1]  1  2  3  4 10 11
!(x > 6) # Is the element not greater than 6
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
x[!(x > 6)] # Displaying the elements that are TRUE
[1] 1 2 3 4 5 6

It is important to play around with the logical operators to get comfortable with filtering out elements that meet certain criteria. If you have multiple conditions you may need to use parenthesis to have it do what you wish, as it does the and operation before the or operation.

x[(x > 5 | x < 3) & x < 4] 
[1] 1 2
x[x > 5 | x < 3 & x < 4]
[1]  1  2  6  7  8  9 10 11

5.7 Sample Function

There are a few special functions in R that we should discuss that will be used throughout the course. The first is the sample() function which will allow us to randomly sample values from a vector that we pass into it. We will be able to choose how many values we want to be outputted and whether we want to allow the repetition of the values (this will need to be true if we want to output more values than we passed in). The exact values that are returned are not predictable as they rely on a random number generator behind the scenes. If we wish to get the same values over and over again then we need to use the set.seed() function to achieve this goal. An example of using the sample() function can be seen below:

sample(1:10, 5, replace=TRUE)
[1]  1  6  7  9 10
sample(1:10, 5, replace=TRUE)
[1] 2 2 9 6 2
sample(1:10, 5, replace=TRUE)
[1] 7 5 1 8 8
sample(1:10, 15, replace=FALSE)

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

sample(1:10, 15, replace=TRUE)
 [1]  4 10  9  3  9  9  3  2  1  1  2  6  4  1  8
set.seed(123)
sample(c("A", "B", "C"), 10, replace=TRUE)
 [1] "C" "C" "C" "B" "C" "B" "B" "B" "C" "A"
sample(c("A", "B", "C"), 10, replace=TRUE)
 [1] "B" "B" "A" "B" "C" "A" "C" "C" "A" "A"
set.seed(123) # This will result in the same thing as above
sample(c("A", "B", "C"), 10, replace=TRUE)
 [1] "C" "C" "C" "B" "C" "B" "B" "B" "C" "A"

5.8 Special Functions in R

Another special function that may come in handy is the which() function. What this will do is tell us the index values of the elements which meet certain criteria. Note in the sample below that it is telling us the 1st, 6th, 7th, and 10th elements in the vector are greater than 40 (it is not telling us the values greater than 40, just the indices of the elements):

x <- sample(1:50, 20, replace=TRUE)
x
 [1] 14 25 26 27  5 27 28  9 29 35  8 26  7 42  9 19 36 14 17 43
which(x > 40)
[1] 14 20

Other functions which may be of use to us are the duplicated() function and the unique() function. The first function, duplicated(), will return a logical vector with TRUE after the first occurrence of duplicated values. So, the second time (and additional times) a number appears it will output TRUE. The unique() function will return just the unique values of the vector, meaning it will remove all of the duplicated values.

x
 [1] 14 25 26 27  5 27 28  9 29 35  8 26  7 42  9 19 36 14 17 43
duplicated(x)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
unique(x)
 [1] 14 25 26 27  5 28  9 29 35  8  7 42 19 36 17 43

Two other functions that will result in a single logical output are the any() and the all() function. These will do what their names sound like, that is they will see if any values in the vector meet certain criteria and if all values in the vector meet certain criteria.

x
 [1] 14 25 26 27  5 27 28  9 29 35  8 26  7 42  9 19 36 14 17 43
any(x > 45)
[1] FALSE
any(x < 10)
[1] TRUE
all(x <= 45)
[1] TRUE
all(x < 10)
[1] FALSE

5.9 Getting Help

While I have thrown a lot of information at you in this lecture, know that R provides help and resources for all functions. So, if we are ever confused about a specific function or do not know what parameters we can pass into the function, then we can always use a question mark to search for the documentation. That is if we are curious about the mean() function then we can type ?mean.

Using two question marks will search the database for the phrase if we are unsure what the function is called. For instance, we could type ?? “Standard Deviation” if we are unsure of the name of the function. Do not worry if you cannot remember everything, as you use it more and more it will become second nature. Even I have to regularly look at the R Documentation to see examples and to see what the options are for each function.