5  More R

So far we have seen the basics of R, including ways to use R as a calculator as well as how to create variables and vectors. As a quick reminder, we can create a vector containing multiple elements by using our c() function (which stands for combine). After creating a vector we could then save it to a variable name by having the value point towards the name of our variable. To do this, we will use the assignment operator which looks like this: \(<\)–. An example of this can be seen below:

x <- c(1,2,3,4,5) # Combining multiple elements into a single vector 
x
[1] 1 2 3 4 5
c("Hello", "world!") -> y # This also works but is not recommended
y
[1] "Hello"  "world!"

Now that we’ve covered the basics of R, including simple calculations, variables, and vectors, it’s time to dive a little deeper. In this section, we will explore more of R’s built-in functionality that makes data manipulation efficient and expressive. You’ll learn how to generate sequences, replicate values, name vector elements, and search using patterns. We’ll also explore special functions that help us handle data more effectively. These tools will become essential as we begin working with larger and more complex datasets.

  • Generate numeric and character vectors using functions like seq(), rep(), and sample().
  • Use indexing techniques to select specific elements from a vector.
  • Apply pattern matching with grep() and understand how regular expressions work within R.
  • Use special functions like which(), duplicated(), unique(), any(), and all() to summarize or filter vector data.

5.1 Creating a Sequence of Numbers

There are additional ways we could make a vector as well. If we wanted all of our numbers in a sequence then we could use the seq() function to do this. With this function, we can specify what our starting value is and what we want the sequence to do. The function also allows us to pass an argument into it which specifies what value we should increment the sequence by. If we do not specify what we should increment by then it will automatically default to 1. We can also count down if we would like. Finally, it should be mentioned that most functions do not require us to write the argument name as long as we pass them in the correct order (but we should probably keep doing it until we are more comfortable).

seq(from=1, to=10)
 [1]  1  2  3  4  5  6  7  8  9 10
seq(from=1, to=10, by=2)
[1] 1 3 5 7 9
seq(7, 1)
[1] 7 6 5 4 3 2 1
seq(from=30, to=1, by=-4)
[1] 30 26 22 18 14 10  6  2
seq(-1, 2, by=0.3)
 [1] -1.0 -0.7 -0.4 -0.1  0.2  0.5  0.8  1.1  1.4  1.7  2.0

If we do not need to increment the sequence by a certain value then we could do something similar with a colon. This will create a sequence by just adding 1 to a value until it reaches the end number. An example of this can be seen below:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
7:-2
 [1]  7  6  5  4  3  2  1  0 -1 -2
0.25:7.75 # Notice how it does not go past the ending value
[1] 0.25 1.25 2.25 3.25 4.25 5.25 6.25 7.25
Try it Out

Emmit is interested in displaying all possible GPA values rounded to 1 decimal place. How can he do this is the GPA must be between 0 and 4.0?

Click to see the solution

5.2 Replicating Values

We can replicate values (of any type) using the rep() function. This will allow us to pass a vector into the function and have it be replicated a certain number of times (and the number of times could also be a vector!). If we pass a vector in for the number of times then it will match it element by element and replicate it the specified number of times before going on to the next element.

rep(2, times=6)
[1] 2 2 2 2 2 2
rep("abc", times=3)
[1] "abc" "abc" "abc"
rep(1:4, times=4:1) # 1st element 4 times, 2nd element 3 times, etc.
 [1] 1 1 1 1 2 2 2 3 3 4
rep(c(7,2,1), times=c(1,4,8))
 [1] 7 2 2 2 2 1 1 1 1 1 1 1 1

The function allows for other arguments as well, such as the length of the outputted sequence and if all of the elements should be replicated a certain amount of times. The ‘each’ argument will replicate each element a certain number of times while the ‘length.out’ argument will keep repeating a sequence until it is of a certain length.

rep(c(7,2,1), times=2)
[1] 7 2 1 7 2 1
rep(c(7,2,1), each=2)
[1] 7 7 2 2 1 1
rep(c(7,2,1), length.out = 8)
[1] 7 2 1 7 2 1 7 2
rep(c(7,2,1), each=3, length.out=10)
 [1] 7 7 7 2 2 2 1 1 1 7
Try it Out

Emmit surveys his friends regarding their favorite fruit, and determines that 7 of them like apples, 4 like bananas, 3 like oranges, and 1 likes pineapple. How can he display this information using the rep() function?

Click to see the solution

5.3 Creating a Vector of Letters

As a quick reminder (since it is very important), R starts indexing at 1. This means that the first element in a vector is in the ‘1’ index. Other languages, like Python, start counting at 0 but we will start counting at 1 in R. In R there is a vector called ‘letters’ that contains all of the lower-case letters. If we want to find out what the 4th letter is then we can use our “index-selection” brackets. We can also pass a vector into the index selection brackets as well, which includes sequences and replicated vectors (as long as they are numeric). There is also a vector in R called `LETTERS’ which acts the same way but contains all capitalized letters.

x <- letters # Saving the vector to 'x' for simplicity
x
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
x[c(13, 15, 21, 14, 20)]
[1] "m" "o" "u" "n" "t"
x[17:22]
[1] "q" "r" "s" "t" "u" "v"
LETTERS[c(8, 5, 12, 12, 15)]
[1] "H" "E" "L" "L" "O"

If we ever wish to have all of the values displayed except for certain ones then we can put a negative in front of the value/vector and it will display everything except those values. This is helpful when it is easier to exclude certain indices instead of having to specify all of the desired indices.

x[seq(1,26,by=2)] # Every other letter
 [1] "a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y"
x[-seq(1,26,by=2)] # Everything but these indices
 [1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
y <- 1:10
y
 [1]  1  2  3  4  5  6  7  8  9 10
y[c(1,5,8)]
[1] 1 5 8
y[-c(1,5,8)]
[1]  2  3  4  6  7  9 10
Try it Out

Emmit is making a display for his smoothie stand and wants it to say “EMMIT smoothies”. Help him do this in R.

Click to see the solution

5.4 Named Vectors

Since we have been discussing vectors and their elements, we should go ahead and mention that we can name the individual elements as well (this might make it easier for us to refer back to a specific element as we might not remember which index it is). There are a few different ways we can do this, the first is by naming them when we create the vector itself. To do this we will just put the name of the element to the left of the element with an equals sign in between. It may look something like this:

x <- c(M="Monday", W="Wednesday", F="Friday")
x
          M           W           F 
   "Monday" "Wednesday"    "Friday" 
names(x)
[1] "M" "W" "F"

Another way that might be beneficial is to name them after the vector has been created using the names() function. We will reference the names function with the vector inside and we will assign a character vector to it. This will update the names of the vector elements as whatever we passed into it. An example of this can also be seen below:

x <- c("Monday", "Wednesday", "Friday")
names(x)
NULL
names(x) <- c("M", "W", "F")
x
          M           W           F 
   "Monday" "Wednesday"    "Friday" 
names(x)
[1] "M" "W" "F"

Because the elements are named, we can pass the names into our index-selection brackets and R will output the element associated with that particular name. We will also still be able to access them with the index value as well.

x["M"]
       M 
"Monday" 
x[c("M", "F")]
       M        F 
"Monday" "Friday" 
x[c(1,3)]
       M        F 
"Monday" "Friday" 
Try it Out

Emmit has been operating the smoothie stand for a few weeks now and wants to calculate how many smoothies of each flavor he has sold. Help him create a named vector if he has sold 42 Strawberry, 37 Mango, and 28 Pineapple smoothies.

Click to see the solution

5.5 Index Selection using GREP

One very powerful tool in R that allows us to search a string for a specific pattern is the grep() function. This stands for Global/Regular Expression/Print and is important to us as it will allow us to identify all of the elements containing a specific pattern. To see this function in action we will utilize the “euro” vector which is a named vector available to us in R.

euro
        ATS         BEF         DEM         ESP         FIM         FRF 
  13.760300   40.339900    1.955830  166.386000    5.945730    6.559570 
        IEP         ITL         LUF         NLG         PTE 
   0.787564 1936.270000   40.339900    2.203710  200.482000 

Before we jump into using the function though, we will want to discuss some of the syntax that the grep() uses. The first is that it expects us to pass a pattern into it using quotation marks. If we type a \(\wedge\) at the beginning of the pattern then it will search for strings starting with the pattern. If we type a $ at the end of the pattern then it will search for strings ending with the pattern. A single period will stand for any character, and characters in brackets will mean “any of these characters”. After we specify the pattern we will need to also specify where we are looking for the pattern, and in our case, it will be the names of the euro vector. The output for this function will be the indices of the elements which contain the specified pattern.

names(euro)
 [1] "ATS" "BEF" "DEM" "ESP" "FIM" "FRF" "IEP" "ITL" "LUF" "NLG" "PTE"
grep("E", names(euro)) # Indices of elements containing an E anywhere
[1]  2  3  4  7 11
euro[grep("E", names(euro))] # Names containing an E anywhere
       BEF        DEM        ESP        IEP        PTE 
 40.339900   1.955830 166.386000   0.787564 200.482000 
grep("^I", names(euro)) # Indices of elements starting with I
[1] 7 8
euro[grep("^I", names(euro))] # Names starting with I
        IEP         ITL 
   0.787564 1936.270000 
grep("F$", names(euro)) # Indices of elements ending with F
[1] 2 6 9
euro[grep("F$", names(euro))] # Names ending with F
     BEF      FRF      LUF 
40.33990  6.55957 40.33990 
grep(".E.", names(euro)) # Indices of elements containing _E_
[1] 2 3 7
euro[grep(".E.", names(euro))] # Names containing with _E_
      BEF       DEM       IEP 
40.339900  1.955830  0.787564 
grep(".[EI].", names(euro)) # Indices of elements containing _E_ or _I_
[1] 2 3 5 7
euro[grep(".[EI].", names(euro))] # Names containing _E_ or _I_
      BEF       DEM       FIM       IEP 
40.339900  1.955830  5.945730  0.787564 

While this function may seem a little complicated at first, it is a very powerful tool that will allow us to filter out observations that meet certain criteria. One example might be searching a vector for all observations with the last name “Smith” or for people whose first name starts with the letters “Ca”.

Try it Out

The smoothie stand has the following flavors. How can he display the flavors that start with Berry, end with Crush, have two consecutive vowels, and those containing a “t” followed by another letter.

flavors <- c("Mango Madness", "Berry Blast", "Peach Punch", "Pineapple Punch", 
             "Acai Antioxidant", "Tropical Berry Twist", "Citrus Crush", 
             "Berry Goodness", "Chocolate Crush", "Vanilla Velvet")
Click to see the solution

5.6 Logical Vectors and Index Selection

We briefly saw it in the previous lecture, but it is important for us to practice accessing vector elements using logical vectors. If any logical operator (\(<,<=, >, >=, ==, !=\)) is used to compare vectors, the resulting output will be a logical vector. This will always be the case! It is good practice to think about what the outputted results will be before we even run the code instead of just “hoping for the best”.

We can access the vector elements using logical vectors in two ways; explicitly and implicitly. Doing it explicitly would mean we save the logical vector to a new variable and then use the new variable in our index-selection brackets while doing it implicitly would mean we place the logical comparison into the index-selection brackets directly. I prefer the implicit method as we do not have to re-run the code if the original vector changes. Both examples can be seen below:

x <- c(4,8,2,6,7,6,3,8,6)
y <- x > 6

y
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
x[y] # Explicit creation
[1] 8 7 8
x[x>6] # Implicit creation
[1] 8 7 8
x
[1] 4 8 2 6 7 6 3 8 6
x >= 7
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
length(x >= 7) # Length of the vector is 11 elements
[1] 9
sum(x >= 7) # Adding the logical vector: TRUE = 1, FALSE = 0
[1] 3
x[x >= 7] # Displaying just the values whose index is TRUE
[1] 8 7 8

We can expand our capabilities with logical operators and introduce a few new operators which will allow us to evaluate multiple conditions at the same time. These include & (which represents and), \(\vert\) (which represents or and is the pipe symbol above the enter key), and ! (which represents not). The and will require both conditions to be true for the output to be true, the or only requires one condition to be true for the output to be true, and the not “flips” the final output to be the opposite.

x <- 1:11
x
 [1]  1  2  3  4  5  6  7  8  9 10 11
(x >5) & (x < 10) # Is the element greater than 5 and less than 10
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
x[(x >5) & (x < 10)] # Displaying the elements that are TRUE
[1] 6 7 8 9
(x < 5) | (x > 9) # Is the elements less than 5 or greater than 9
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
x[(x < 5) | (x > 9)] # Displaying the elements that are TRUE
[1]  1  2  3  4 10 11
!(x > 6) # Is the element not greater than 6
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
x[!(x > 6)] # Displaying the elements that are TRUE
[1] 1 2 3 4 5 6

It is important to play around with the logical operators to get comfortable with filtering out elements that meet certain criteria. If you have multiple conditions you may need to use parenthesis to have it do what you wish, as it does the and operation before the or operation.

x[(x > 5 | x < 3) & x < 4] 
[1] 1 2
x[x > 5 | x < 3 & x < 4]
[1]  1  2  6  7  8  9 10 11
Try it Out

Emmit has calculated the number of smoothies he has sold for each of the past 10 days. How many days did he sell less than 7.5 smoothies? How many days did he sell either less than 5 or greater than 15 smoothies?

sales <- c(12, 5, 9, 15, 7, 3, 10, 20, 7, 6)
Click to see the solution

5.7 Sample Function

There are a few special functions in R that we should discuss that will be used throughout the course. The first is the sample() function which will allow us to randomly sample values from a vector that we pass into it. We will be able to choose how many values we want to be outputted and whether we want to allow the repetition of the values (this will need to be true if we want to output more values than we passed in). The exact values that are returned are not predictable as they rely on a random number generator behind the scenes. If we wish to get the same values over and over again then we need to use the set.seed() function to achieve this goal. An example of using the sample() function can be seen below:

sample(1:10, 5, replace=TRUE)
[1] 2 8 2 9 6
sample(1:10, 5, replace=TRUE)
[1]  6 10  8  4  9
sample(1:10, 5, replace=TRUE)
[1] 2 9 4 1 7
sample(1:10, 15, replace=FALSE)

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

sample(1:10, 15, replace=TRUE)
 [1]  6 10  9  6  4  8 10  5  4  7  2  7  8  7  8
set.seed(123)
sample(c("A", "B", "C"), 10, replace=TRUE)
 [1] "C" "C" "C" "B" "C" "B" "B" "B" "C" "A"
sample(c("A", "B", "C"), 10, replace=TRUE)
 [1] "B" "B" "A" "B" "C" "A" "C" "C" "A" "A"
set.seed(123) # This will result in the same thing as above
sample(c("A", "B", "C"), 10, replace=TRUE)
 [1] "C" "C" "C" "B" "C" "B" "B" "B" "C" "A"
Try it Out

Emmit wants to run a promotion where he randomly selects smoothie flavors to feature each day. Help him write R code which randomly picks 3 flavors to promote each day.

flavors <- c("Mango Madness", "Berry Blast", "Peach Punch", "Pineapple Punch", 
             "Acai Antioxidant", "Tropical Berry Twist", "Citrus Crush", 
             "Berry Goodness", "Chocolate Crush", "Vanilla Velvet")
Click to see the solution

5.8 Special Functions in R

Another special function that may come in handy is the which() function. What this will do is tell us the index values of the elements which meet certain criteria. Note in the sample below that it is telling us the 6th, 12th, 13th, and 20th elements in the vector are greater than 40 (it is not telling us the values greater than 40, just the indices of the elements):

x <- sample(1:50, 20, replace=TRUE)
x
 [1]  4 39  1 34 23 43 14 18 33 21 21 42 46 10  7  9 15 21 37 41
which(x > 40) # Indices with values greater than 40
[1]  6 12 13 20
x[which(x > 40)] # Values greater than 40
[1] 43 42 46 41

Other functions which may be of use to us are the duplicated() function and the unique() function. The first function, duplicated(), will return a logical vector with TRUE after the first occurrence of duplicated values. So, the second time (and additional times) a number appears it will output TRUE. The unique() function will return just the unique values of the vector, meaning it will remove all of the duplicated values.

x
 [1]  4 39  1 34 23 43 14 18 33 21 21 42 46 10  7  9 15 21 37 41
duplicated(x)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
unique(x)
 [1]  4 39  1 34 23 43 14 18 33 21 42 46 10  7  9 15 37 41

Two other functions that will result in a single logical output are the any() and the all() function. These will do what their names sound like, that is they will see if any values in the vector meet certain criteria and if all values in the vector meet certain criteria.

x
 [1]  4 39  1 34 23 43 14 18 33 21 21 42 46 10  7  9 15 21 37 41
any(x > 45)
[1] TRUE
any(x < 10)
[1] TRUE
all(x <= 45)
[1] FALSE
all(x < 10)
[1] FALSE
Try it Out

Emmit has calculated the number of smoothies he has sold for each of the past 10 days. Using the functions in this section determine which days had sales greater than or equal to 15 and if the sales on a day matched a previous day’s total.

sales <- c(12, 5, 9, 15, 7, 3, 10, 20, 7, 6)
Click to see the solution

5.9 Getting Help

While I have thrown a lot of information at you in this lecture, know that R provides help and resources for all functions. So, if we are ever confused about a specific function or do not know what parameters we can pass into the function, then we can always use a question mark to search for the documentation. That is if we are curious about the mean() function then we can type ?mean.

Using two question marks will search the database for the phrase if we are unsure what the function is called. For instance, we could type ?? “Standard Deviation” if we are unsure of the name of the function. Do not worry if you cannot remember everything, as you use it more and more it will become second nature. Even I have to regularly look at the R Documentation to see examples and to see what the options are for each function.

  • Create sequences in R to do the following:

    1. Create a sequence of numbers from 1 to 20 incrementing by 3 using the seq() function.
    2. Using the rep() function, create a vector that contains the elements 5, 10, and 15, where 5 is repeated 3 times, 10 is repeated 2 times, and 15 is repeated 4 times.
    3. Using the built-in letters vector, extract and display the letters at the 3rd, 6th, 9th, and 12th positions.
    4. Name the vector c(50, 100, 200) with the names "Small", "Medium", and "Large", then access the element named "Medium".
  • Create a vector states with values: “California”, “Colorado”, “Connecticut”, “Delaware”, “Florida”, “Georgia”, “Hawaii”, “Idaho”, “Illinois”, “Indiana”

    1. Use grep() to find all states that start with the letter "C" and display those states.
    2. Use grep() to find all states that end with the letter "a" and display those states.
    3. Use grep() to find all states that contain the letters "_n" (with _ being any letter) or "or".
    4. Use logical operators to display all states that contain either "Florida" or "Georgia".
  • Use the sample() function to randomly select 7 values from the numbers 1 through 20 without replacement.

  • Create a vector temperature with the values 72, 75, 78, 80, 77, 74, 91, 84, 85, 93, 80.

    1. Use the which() function to find the indices where the daily temperature is equal to 80
    2. Use the duplicated() function to identify which day’s temperatures are duplicates.
    3. Use the unique() function to get the distinct daily temperature values.
    4. Use the any() function to check if any days have temperatures below 80.
    5. Use the all() function to check if all days have temperatures 80 or less.