x <- c(1,2,3,NA)
x[1] 1 2 3 NA
is.na(x)[1] FALSE FALSE FALSE TRUE
sum(is.na(x))[1] 1
mean(x)[1] NA
mean(x, na.rm=TRUE)[1] 2
As we go further in our journey to becoming Data Scientists, we will come across some special values which may occasionally cause a headache or two. In this lecture, we aim to give you a crash course on some of the special values you may encounter. This includes values you may see after running functions, along with working with dates and times. Finally, we will conclude this lecture by looking at factors and how we can use them to store and work with categorical data more efficiently.
NA, NaN, and \(\pm\) Inf) using is.na(), is.nan(), and is.infinite(), and explain how they affect calculations.strptime(), as.Date(), as.POSIXct(), and as.POSIXlt(), and recognize the differences between these storage formats.difftime() to measure differences in different units.As we work in R, we may occasionally come across a few different special values. These include missing values (NA), values which are Not a Number (NaN), and infinite values (\(\pm\) Inf). It is important to understand why each value occurs, because these are often the source of confusing results in our code. The first special value we will look at is NA. This represents a missing value, and it can be identified using the is.na() function. It should be noted that is.na() will also identify NaN values. Likewise, if we try to do math on a vector which contains a missing value, then the output will also become missing. We can correct for this by using the argument na.rm=TRUE to specify that we want NA values removed before running the calculations.
x <- c(1,2,3,NA)
x[1] 1 2 3 NA
is.na(x)[1] FALSE FALSE FALSE TRUE
sum(is.na(x))[1] 1
mean(x)[1] NA
mean(x, na.rm=TRUE)[1] 2
The next special values that we might encounter are NaN and Inf. These values are produced whenever the math is either not possible (such as taking the square root of a negative number), or when the result is too large to store in the computer. If we divide by 0, we may get NaN or Inf due to IEEE standards, even though mathematically we know the expression is undefined. So, do not worry about harming the USS Yorktown if you accidentally divide by zero. We can use the is.nan() and is.infinite() functions to identify these values.
0/0[1] NaN
(-1)^0.5[1] NaN
2^478385[1] Inf
5/0[1] Inf
-5/0[1] -Inf
We can now put all of our skills together and look at an example containing all of the special values we have discussed so far. To begin, we use the sample() function to procure a random sample of values. You will get a different sample than me if you do not set the seed to the same value. Notice how we can incorporate the which() function to determine the indices of the elements that meet each condition. Also, notice that the is.na() function flags both NA and NaN values, while the is.nan() function only identifies NaN values.
set.seed(8675309)
y <- sample(c(5, NA, NaN, -Inf, Inf), 13, replace=TRUE)
y [1] -Inf NaN NA NaN Inf NaN 5 Inf NaN 5 -Inf Inf NA
table(y, useNA = "ifany")y
-Inf 5 Inf NaN <NA>
2 2 3 4 2
sum(is.na(y))[1] 6
which(is.na(y))[1] 2 3 4 6 9 13
sum(is.nan(y))[1] 4
which(is.nan(y))[1] 2 4 6 9
sum(is.infinite(y))[1] 5
which(is.infinite(y))[1] 1 5 8 11 12
Emmit is cleaning up some data from a sensor that tracks how many steps he takes each day. When he prints the vector, he notices that some values are missing, some values are not real numbers, and some values look infinite. Using the vector below, teach Emmit how he can use R to count how many missing and infinite values are present and report their indices. Then show him how he can compute the mean number of steps Emmit took without the missing values present.
steps <- c(6200, 7500, NA, 5800, NaN, 8100, Inf, -Inf, 6600)Dealing with dates and times in R can sometimes be a tricky task because there are multiple different representations they can take. For instance, some formats store dates as the number of days since January 1st, 1970, while others store date/time as the number of seconds since January 1st, 1970. Still others store the date/time as a list describing the seconds, minutes, hours, month, year, etc. Therefore, this section aims to give you insight into how we can manage dates and times in R.
Let’s first take a look at the following example. Here, I have a date saved as a character string. I can then use the strptime() function (which stands for string parsed time) to convert the string into a time object. The function works by passing the string into the function and then specifying the format that the string follows. For example, you must specify that the month comes first, then the day, and then the year. To do this, you need to use formatting abbreviations (which can be found in the table directly beneath this code), and the format must match the string exactly. If it does not match, an NA value will be returned.
tuesday_class <- "January 27th, 2026"
strptime(tuesday_class, "%B %dth, %Y", tz = "EST")[1] "2026-01-27 EST"
tuesday_class <- "Jan. 27, 26"
strptime(tuesday_class, "%B %dth, %Y", tz = "EST") # Does not match [1] NA
strptime(tuesday_class, "%b. %d, %y", tz = "EST") [1] "2026-01-27 EST"
| Symbol | Meaning | Example |
|---|---|---|
%d |
day as a number | 1-31 |
%a |
abbreviated weekday | Mon |
%A |
unabbreviated weekday | Monday |
%m |
month as a number | 1-12 |
%b |
abbreviated month | Jan |
%B |
unabbreviated month | January |
%y |
2-digit year | 26 |
%Y |
4-digit year | 2026 |
%S |
seconds as a number | 0-59 |
%M |
minutes as a number | 0-59 |
%I |
hours from 12-hour clock | 1-12 |
%H |
hours from 24-hour clock | 0-23 |
%p |
am/pm indicator | AM/PM |
The first date/time type that we will discuss is the Date type. This type does not include any time component, and instead only stores the calendar date. It stores the date as the number of days since January 1st, 1970. This type is helpful when the difference between dates should be measured in days, since it does not include increments smaller than a day. To create a Date object, we can use the as.Date() function as shown below. Using the unclass() function, we can see how the data is stored. For example, January 27th, 2026 is stored as 20480 days since January 1st, 1970.
tuesday_class <- "January 27th, 2026"
tuesday_class_date <- as.Date(tuesday_class, "%B %dth, %Y")
unclass(tuesday_class_date)[1] 20480
Notice how no aspect of the time is stored using this method.
superbowl <- "February 8th, 2026 at 6:30 PM"
strptime(superbowl, "%B %dth, %Y at %I:%M %p", tz = "EST")[1] "2026-02-08 18:30:00 EST"
sb_date <- as.Date(superbowl, "%B %dth, %Y at %I:%M %p", tz = "EST")
sb_date[1] "2026-02-08"
class(sb_date)[1] "Date"
unclass(sb_date)[1] 20492
Another way we may wish to store the date/time is as the number of seconds since January 1st, 1970. This is beneficial if we are interested in calculating precise differences between dates and times. To do this, we use the as.POSIXct() function (the "ct" can be thought of as calendar time or continuous time). Using the class() function, we can see that the variable is no longer stored as a Date, but instead as a POSIX time object (note: POSIX stands for the Portable Operating System Interface, which is a standard set by the IEEE). Notice how the time component is now stored, along with the specified time zone. When we use the unclass() function, we can see that the value is stored as a number of seconds rather than a number of days.
superbowl <- "February 8th, 2026 at 6:30 PM"
sb_ct <- as.POSIXct(superbowl, "%B %dth, %Y at %I:%M %p", tz = "EST")
sb_ct[1] "2026-02-08 18:30:00 EST"
class(sb_ct)[1] "POSIXct" "POSIXt"
unclass(sb_ct)[1] 1770593400
attr(,"tzone")
[1] "EST"
The final format we will discuss stores date/time as a list of components corresponding with the given date/time. To do this we will use the as.POSIXlt() function (the "lt" can be thought of as local time or list time). It stores the seconds, minutes, and hours (on a 24-hour clock) along with the day of the month, the month (with January being 0), the year since 1900, and additional useful information. This can all be seen below using the unclass() function. (Note: the output format may vary depending on your system.)
superbowl <- "February 8th, 2026 at 6:30 PM"
sb_lt <- as.POSIXlt(superbowl, "%B %dth, %Y at %I:%M %p", tz = "EST")
sb_lt [1] "2026-02-08 18:30:00 EST"
class(sb_lt)[1] "POSIXlt" "POSIXt"
unclass(sb_lt)$sec
[1] 0
$min
[1] 30
$hour
[1] 18
$mday
[1] 8
$mon
[1] 1
$year
[1] 126
$wday
[1] 0
$yday
[1] 38
$isdst
[1] 0
$zone
[1] "EST"
$gmtoff
[1] NA
attr(,"tzone")
[1] "EST"
attr(,"balanced")
[1] TRUE
Emmit is scheduling study sessions for his Data Science class, but he wrote the dates in different formats in his notes. He has the following four entries: “January 27th, 2026”, “Feb. 3, 26”, “Tuesday the 10th of February 2026”, and “02-17-26”. Teach Emmit how he can convert each entry into a date object.
While the three formats above are important, it can be hard to remember them all in practice. A recent package that makes working with dates and times easier is the lubridate library. This library allows us to use functions like mdy_hm() if our string is in Month-Day-Year Hour-Minute form. There are similar functions for other common formats (like ymd(), ymd_hms(), dmy_h(), etc.). The output of these functions is typically a POSIXct object (which stores the number of seconds since January 1st, 1970).
# if it is your first time using it you may need to install it first
# install.packages(lubridate)
library(lubridate)superbowl <- "February 8th, 2026 at 6:30 PM"
sb_lubridate <- mdy_hm(superbowl, tz="EST")
sb_lubridate[1] "2026-02-08 18:30:00 EST"
class(sb_lubridate)[1] "POSIXct" "POSIXt"
unclass(sb_lubridate)[1] 1770593400
attr(,"tzone")
[1] "EST"
The lubridate library is very powerful, because it often does not require us to specify the exact structure of the string like we needed to do earlier. Below, we show four different ways of writing February 4th, 2026, and as long as we specify that they are in Day-Month-Year form, lubridate will correctly convert each one. We can even pass in strings with different formats into the parse_date_time() function as long as we list the possible formats.
However, we do need to be careful, because some date formats are ambiguous. For example, one of the entries below is interpreted incorrectly because 26-02-04 could represent multiple formats depending on the convention. The last example clears this up by writing the year as 2026 instead of just 26.
dates <- c("4th of February 2026", "4/Feb/26", "04-02-2026", "040226")
dmy(dates)[1] "2026-02-04" "2026-02-04" "2026-02-04" "2026-02-04"
dates <- c("4th of February 2026", "26-02-04", "04-02-2026")
parse_date_time(dates, c("dmy", "ymd"))[1] "2026-02-04 UTC" "2004-02-26 UTC" "2026-02-04 UTC"
dates <- c("4th of February 2026", "2026-02-04", "04-02-2026")
parse_date_time(dates, c("dmy", "ymd"))[1] "2026-02-04 UTC" "2026-02-04 UTC" "2026-02-04 UTC"
Emmit decided that manually typing formatting strings is too annoying, so he wants to use lubridate instead. He has the following entries: “2/04/2026 6:15 PM”, “2026 2-11”, “18th of February 2026 5 PM”, and “Wed. 2/25/2026”. Teach Emmit how he can use the lubridate package to convert these values into date/times.
One of the reasons it is important to talk about the different ways dates and times are stored in R is because we often find ourselves wanting to do “math” with them. For instance, maybe we are interested in seeing how many seconds it took for a function to run, how many days old we are, or the time differences between purchases or incidents at a large company. To do those things, we first need to convert values into an actual date/time type, and then we need to compute differences between those date/time values. Both are shown below.
If we are dealing with dates/times saved as POSIXct (the number of seconds since January 1st, 1970), then we can subtract the two variables and R will return a time difference (by default displayed in days, often with decimals). The variables need to be compatible date/time types in order to subtract them. Storing time in seconds allows us to be fairly precise in determining time differences. We will also see how to request results in units other than days a little later on.
mlk_day <- "January 19th, 2026"
as.POSIXct(mlk_day, "%B %dth, %Y", tz = "EST")[1] "2026-01-19 EST"
mlk_ct <- as.POSIXct(mlk_day, "%B %dth, %Y", tz = "EST")
unclass(mlk_ct)[1] 1768798800
attr(,"tzone")
[1] "EST"
valentines_day_dinner <- "February 14th, 2026 at 6:47 PM"
valentines_ct <- as.POSIXct(valentines_day_dinner, "%B %dth, %Y at %I:%M %p", tz = "EST")
unclass(valentines_ct)[1] 1771112820
attr(,"tzone")
[1] "EST"
valentines_ct - mlk_ctTime difference of 26.78264 days
(1771112820- 1768798800)/(60*60*24)[1] 26.78264
We can do something similar if our data is saved as a Date (the number of days since January 1st, 1970). This will return differences in whole days only, and it does not allow for any measurement smaller than a day.
mlk_date <- as.Date(mlk_day, "%B %dth, %Y", tz = "EST")
unclass(mlk_date)[1] 20472
valentines_date <- as.Date(valentines_day_dinner, "%B %dth, %Y at %I:%M %p", tz = "EST")
unclass(valentines_date)[1] 20498
valentines_date - mlk_dateTime difference of 26 days
(20498 - 20472)[1] 26
Before discussing another method for calculating differences, it helps to know how to get “right now” and “today” in R. In lubridate, now() gives the current date/time and today() gives the current date. (These results will depend on when and where you run the code.)
now()[1] "2026-02-01 15:40:34 EST"
today()[1] "2026-02-01"
If you want the time difference in a unit other than days, you can use the difftime() function. This function allows you to pass in two date/times along with the units you want the difference in. The default units are days.
difftime(valentines_ct, mlk_ct)Time difference of 26.78264 days
difftime(valentines_ct, mlk_ct, units="sec")Time difference of 2314020 secs
difftime(valentines_ct, mlk_ct, units="mins")Time difference of 38567 mins
difftime(valentines_ct, mlk_ct, units="hours")Time difference of 642.7833 hours
difftime(valentines_ct, mlk_ct, units="days")Time difference of 26.78264 days
difftime(valentines_ct, mlk_ct, units="weeks")Time difference of 3.826091 weeks
Emmit is trying to see how long he actually spends working on homework (he says “2-3 hours”, but the data might disagree). He recorded the start time as “February 13th, 2026 at 6:47 PM” and the end time as “February 13th, 2026 at 8:05 PM”. Teach him how he can determine the time he spent using both the POSIXct format and using the difftime() function, while reporting the length of time in both hours and minutes.
If we are dealing with categorical values, then it will often make sense to store the data as a factor. This is because categorical data usually has a relatively small number of possible values, and factors allow R to store those repeated categories efficiently “behind the scenes” by keeping a set of levels and referencing them. We do not need to be too concerned about exactly how the storage works, but we should know how to convert a vector to a factor when we have categorical data with finitely many repeated values.
To convert a vector to a factor, we can use the factor() function. An example is shown below. Notice how the elements are converted from a character vector to a factor. Also notice that it does not make sense to do a logical comparison like < for an unordered factor, since one category is not inherently greater than another (so we get a warning message).
medals <- c("gold", "silver", "bronze", "none")
set.seed(123)
medals_won <- sample(medals, 20, replace=TRUE, prob=c(0.1, 0.2, 0.3, 0.4))
medals_won [1] "none" "silver" "bronze" "silver" "gold" "none" "bronze" "silver"
[9] "bronze" "bronze" "gold" "bronze" "bronze" "bronze" "none" "silver"
[17] "none" "none" "none" "gold"
medals_factor <- factor(medals_won)
medals_factor [1] none silver bronze silver gold none bronze silver bronze bronze
[11] gold bronze bronze bronze none silver none none none gold
Levels: bronze gold none silver
medals_factor[1] < medals_factor[2]Warning in Ops.factor(medals_factor[1], medals_factor[2]): '<' not meaningful
for factors
[1] NA
In the example above we created a nominal (unordered) factor. We can create an ordinal (ordered) factor by using the argument ordered=TRUE. We should be careful with this, though, because while it creates an ordered factor, the levels may not be in the correct order. By default, R places the levels in alphabetical order, which is usually not the order we want for something like medals.
To fix this, we can specify the order of the levels using the levels argument. The levels must be spelled exactly the same as the values in the vector, or else missing values will be inserted into the factor.
medals_ordered <- factor(medals_won, ordered=TRUE)
medals_ordered [1] none silver bronze silver gold none bronze silver bronze bronze
[11] gold bronze bronze bronze none silver none none none gold
Levels: bronze < gold < none < silver
medals_ordered <- factor(medals_won, ordered=TRUE,
levels = c("none", "bronze", "silver", "GOLD"))
medals_ordered <- factor(medals_won, ordered=TRUE,
levels = c("none", "bronze", "silver", "gold"))
medals_ordered [1] none silver bronze silver gold none bronze silver bronze bronze
[11] gold bronze bronze bronze none silver none none none gold
Levels: none < bronze < silver < gold
Looking at the example above, we can see that a mislabeled level (like “GOLD” instead of “gold”) results in missing values. Once the ordering is set correctly, logical comparisons are now meaningful. For example, it makes sense to say that receiving no medal is “less than” receiving a silver medal.
medals_ordered[1][1] none
Levels: none < bronze < silver < gold
medals_ordered[2][1] silver
Levels: none < bronze < silver < gold
medals_ordered[1] < medals_ordered[2][1] TRUE
One more useful skill with factors is renaming the level labels. This can be done by using levels() to view the current levels and then assigning a new character vector of level names (in the same order). This will rename all factor values according to the updated level labels.
levels(medals_ordered)[1] "none" "bronze" "silver" "gold"
levels(medals_ordered) <- c("other", "3rd", "2nd", "1st")
medals_ordered [1] other 2nd 3rd 2nd 1st other 3rd 2nd 3rd 3rd 1st 3rd
[13] 3rd 3rd other 2nd other other other 1st
Levels: other < 3rd < 2nd < 1st
levels(medals_ordered)[1] "other" "3rd" "2nd" "1st"
Emmit is tracking the outcome of his weekly quiz grades, but instead of using numbers he labels each attempt as “Proficient”, “Developing”, or “Not-Yet”. Teach Emmit how he can create a vector called quiz_status containing 24 randomly sampled values from those three categories. Explain how he can then convert the vector into a factor and determine how many times each category appears. Finally, show him how to convert it into an ordered factor.