class: ur-title, center, middle, title-slide .title[ # BST430 Lecture 7 ] .subtitle[ ## Factors, Dates and Times ] .author[ ### Tanzy Love, based on the course by Andrew McDavid ] .institute[ ### U of Rochester ] .date[ ### 2021-09-19 (updated: 2024-09-23 by TL) ] --- class: middle # Data classes --- ## Data classes Recall: we talked about *types* (elements) and *classes* (compounds) - Vectors are like Lego building blocks -- - We stick them together to build more complicated constructs, e.g. *representations of data* -- - The **class** attribute relates to the S3 class of an object which determines its behaviour - You don't need to worry about what S3 classes really mean, but you can read more about it [here](https://adv-r.hadley.nz/s3.html#s3-classes) if you're curious -- - Examples: factors, dates, and data frames Here's the [R code in this lecture](l07/l07-factors-dates.R) --- ## Factors R uses factors to handle categorical variables, variables that have a fixed and known set of possible values ``` r x = factor(c("BS", "MS", "PhD", "MS")) x ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` -- .pull-left[ ``` r typeof(x) ``` ``` ## [1] "integer" ``` ] .pull-right[ ``` r class(x) ``` ``` ## [1] "factor" ``` ] --- ## More on factors We can think of factors like character (level labels) and an integer (level numbers) glued together ``` r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ``` r as.integer(x) ``` ``` ## [1] 1 2 3 2 ``` --- ## Dates ``` r library(lubridate) y = ymd("2020-01-01") y ``` ``` ## [1] "2020-01-01" ``` ``` r typeof(y) ``` ``` ## [1] "double" ``` ``` r class(y) ``` ``` ## [1] "Date" ``` --- ## More on dates We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together ``` r as.integer(y) ``` ``` ## [1] 18262 ``` ``` r as.integer(y) / 365 # roughly 50 yrs ``` ``` ## [1] 50.03288 ``` --- class: middle # Working with factors --- ## Store and read data as character strings ``` r #a subest of the NYC flights with 15 biggest destinations glimpse(flights10) ``` ``` ## Rows: 182,162 ## Columns: 8 ## $ origin <chr> "JFK", "LGA", "EWR", "EWR", "JFK", "LGA", "JFK… ## $ dest <chr> "MIA", "ATL", "ORD", "FLL", "MCO", "ORD", "TPA… ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013… ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ dep_time <int> 542, 554, 554, 555, 557, 558, 558, 558, 558, 5… ## $ hour <dbl> 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6… ## $ minute <dbl> 40, 0, 58, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 5, 1… ``` --- ## But coerce when plotting ``` r ggplot(flights10, mapping = aes(x = dest)) + geom_bar() + theme_minimal() + coord_flip() ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Use forcats to manipulate factors ``` r library(forcats) *plt = flights10 %>% # assign plot to object * mutate(dest_fct = fct_infreq(dest)) %>% ggplot(mapping = aes(x = dest_fct)) + geom_bar() + coord_flip() + theme_minimal() *plt #print it ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-11-1.png" width="55%" style="display: block; margin: auto;" /> --- ## useful forcats functions * Set factor levels by order of appearance `fct_inorder()`, numeric value `fct_inseq()`, number of observations `fct_infreq()`, * Reverse order of factor levels `fct_rev()`, make an arbitrary change `fct_relevel()` * Combine factor levels by number of occurences `fct_lump_n()`, order statistic `fct_lump_min()`, or frequency `fct_lump_prop()` * Recode/combine level names `fct_recode()`, combine and drop levels `fct_collapse()` * Set level order by covariate `fct_reorder()` * Convert `NA` `fct_explicit_na()` --- ## Collapse rare categories ``` r plt %+% # #<< Replace plot data (flights_sml %>% * mutate(dest_fct = fct_lump_n(dest, 15) %>% fct_infreq)) ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Recode categories ``` r plt %+% (flights_sml %>% * mutate(dest_fct = fct_recode(dest, 'Western NY' = 'ROC', * 'Western NY' = 'BUF', * 'Western NY' = 'SYR') %>% fct_lump_n(15) %>% fct_infreq)) ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> For more complicated recoding, consider encoding with a data frame and using a join. --- class: middle # Working with dates --- ## Make a date .pull-left[ <img src="l07/img/lubridate-not-part-of-tidyverse.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ - **lubridate** is the tidyverse-friendly package that makes dealing with dates a little easier - It's installed with `install.packages("tidyverse")` but needs to be explicitly loaded with `library(lubridate)` ] --- class: middle .hand[.light-blue[ we're just going to scratch the surface of working with dates in R here... ]] --- .question[ Calculate and visualize the number of departures on any given date. ] --- ## Step 1. Construct dates .midi[ ``` r library(glue) flights %>% mutate( * date = glue("{year} {month} {day}") ) %>% relocate(date) ``` ``` ## # A tibble: 336,776 × 20 ## date year month day dep_time sched_dep_time dep_delay ## <glue> <int> <int> <int> <int> <int> <dbl> ## 1 2013 1 1 2013 1 1 517 515 2 ## 2 2013 1 1 2013 1 1 533 529 4 ## 3 2013 1 1 2013 1 1 542 540 2 ## 4 2013 1 1 2013 1 1 544 545 -1 ... ``` ] --- ## Step 2. Count flights per date .midi[ ``` r flights %>% mutate(date = glue("{year} {month} {day}")) %>% count(date) ``` ``` ## # A tibble: 365 × 2 ## date n ## <glue> <int> ## 1 2013 1 1 842 ## 2 2013 1 10 932 ## 3 2013 1 11 930 ## 4 2013 1 12 690 ## # ℹ 361 more rows ``` ] --- ## Step 3. visualize flights per date .midi[ ``` r flights %>% mutate(date = glue("{year} {month} {day}")) %>% count(date) %>% ggplot(aes(x = date, y = n, group = 1)) + geom_line() ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" /> ] --- .hand[zooming in a bit... first 7 days `slice(1:7)`] .question[ Why does 10 Jan come after 1 Jan? ] .midi[ <img src="l07-factors-dates_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Step 1. *REVISED* Construct dates "as dates" .midi[ ``` r library(lubridate) flights %>% mutate( * date = ymd(glue("{year} {month} {day}")) ) %>% relocate(date) ``` ``` ## # A tibble: 336,776 × 20 ## date year month day dep_time sched_dep_time dep_delay ## <date> <int> <int> <int> <int> <int> <dbl> ## 1 2013-01-01 2013 1 1 517 515 2 ## 2 2013-01-01 2013 1 1 533 529 4 ## 3 2013-01-01 2013 1 1 542 540 2 ## 4 2013-01-01 2013 1 1 544 545 -1 ... ``` ] --- ## Step 2. Count flights per date .midi[ ``` r flights %>% mutate(date = ymd(glue("{year} {month} {day}"))) %>% count(date) ``` ``` ## # A tibble: 365 × 2 ## date n ## <date> <int> ## 1 2013-01-01 842 ## 2 2013-01-02 943 ## 3 2013-01-03 914 ## 4 2013-01-04 915 ## # ℹ 361 more rows ``` ] --- ## Step 3. visualize flights per date .midi[ ``` r flights %>% mutate(date = ymd(glue("{year} {month} {day}"))) %>% count(date) %>% ggplot(aes(x = date, y = n, group = 1)) + geom_line() ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Flights by day of week [code](l07/flights_day_of_week.R) --- ## Other lubridate date functions * Construct from month day year `mdy()`, or day month year `dmy()` * Extract or set components `month()`, `day()`, `year()` * Extract `quarter()` or day of week `weekdays()` * Extract or set day-of-year `yday()`, day-of-quarter `qday()`, day-of week `wday()` * Days elapsed between two dates: `date1 - date2` * Lead or lag: `date + days_to_lead` --- class: middle # Working with times --- # Date + Time When you have a date + a time, then everything works as before, it just takes extra oomph to construct them: ``` r dmy_hms('22-Sep-2021 11:00:00') ``` ``` ## [1] "2021-09-22 11:00:00 UTC" ``` -- ... and you need to worry about the timezone 🙀 ``` r dmy_hms('22-Sep-2021 11:00:00', tz = 'America/New_York') ``` ``` ## [1] "2021-09-22 11:00:00 EDT" ``` --- ## When you only have / want the time? ``` r flights_sml = flights_sml %>% * mutate(time = hm(glue("{hour} {minute}"))) %>% relocate(time) flights_sml ``` ``` ## # A tibble: 336,776 × 9 ## time origin dest year month day dep_time hour minute ## <Period> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> ## 1 5H 15M 0S EWR IAH 2013 1 1 517 5 15 ## 2 5H 29M 0S LGA IAH 2013 1 1 533 5 29 ## 3 5H 40M 0S JFK MIA 2013 1 1 542 5 40 ## 4 5H 45M 0S JFK BQN 2013 1 1 544 5 45 ## # ℹ 336,772 more rows ``` --- ## When do flights depart? ``` r ggplot(flights_sml, aes(x = time, fill = origin)) ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `Period` needs special treatment in ggplot2 ``` r ggplot(flights_sml, aes(x = time, fill = origin))+ geom_density(alpha = .5) + * scale_x_time() ``` <img src="l07-factors-dates_files/figure-html/unnamed-chunk-26-1.png" width="60%" style="display: block; margin: auto;" /> --- ### Probably under-smoothed, Weird Units .panelset[ .panel[.panel-name[Code] ``` r plt = ggplot(flights_sml, aes(x = time, after_stat(count), fill = origin)) + * geom_density(alpha = .5, bw = 1800) + # 30*60 seconds scale_x_time() + scale_y_continuous(sec.axis = * sec_axis(trans = ~ .x/365*3600, name = 'Departures/day/hour')) + theme_minimal() + labs(y = "Departures/year/second") ``` ``` ## Warning: The `trans` argument of `sec_axis()` is deprecated as of ggplot2 ## 3.5.0. ## ℹ Please use the `transform` argument instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this ## warning was generated. ``` ``` r plt ``` ] .panel[.panel-name[Plot] <img src="l07-factors-dates_files/figure-html/unnamed-chunk-27-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- class: code70 ## Why bimodal? .panelset[ .panel[.panel-name[Code] ``` r flights_jn = flights_sml %>% left_join(airports, c('dest' = 'faa')) %>% filter(!is.na(lon)) %>% #missing puerto rico * mutate(region = cut(lon, * breaks = c(-158, -124, -104, -83, -66), # 5 breakpoints * labels = c('HI/AK', 'West', 'Central', 'East'))) # 4 groups plt %+% flights_jn + facet_wrap(~region, scales = 'free_y') + theme(axis.text.x = element_text(angle = 90)) ``` ] .panel[.panel-name[Plot] <img src="l07-factors-dates_files/figure-html/unnamed-chunk-28-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## Acknowledgments and Resources Adapted from [Data science in a box](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d11-data-classes/u2-d11-data-classes.html#1) [R4DS on Factors](https://r4ds.had.co.nz/factors.html)