class: ur-title, center, middle, title-slide .title[ # BST430 Lecture 5b ] .subtitle[ ## dplyr ii ] .author[ ### Tanzy Love, based on the course by Andrew McDavid ] .institute[ ### U of Rochester ] .date[ ### 2021-09-12 (updated: 2024-09-18 by TL) ] --- class: middle .hand[# A lil' bit more dplyr] --- ## Agenda * grouping by multiple variables * empty groups and summarizing * ungrouping * grouped mutates and filters * special selection operators * repeating operations on multiple columns with `across` <!-- * row names and row numbers--> Here's the [R code in this lecture](l05a/l05b-advanced-dplyr.R) --- class: code70 ## grouping As we saw in lecture 3 by adding `group_by()` to a pipeline, we stratify by a categorical variable when we `summarize()`: .pull-left[ **Grand margin** ``` r library(nycflights13) ontime = flights %>% mutate(ontime = arr_delay <= 0) ontime %>% summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100) ``` ``` ## # A tibble: 1 × 1 ## ontime_pct ## <dbl> ## 1 59.4 ``` ] .pull-right[ **stratified by carrier** ``` r ontime %>% group_by(carrier) %>% * summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100) ``` ``` ## # A tibble: 16 × 2 ## carrier ontime_pct ## <chr> <dbl> ## 1 9E 61.6 ## 2 AA 66.5 ## 3 AS 73.3 ## 4 B6 56.3 ## # ℹ 12 more rows ``` ] --- ## `group_by()` multiple variables It's easy to group by multiple variables: ```r flights %>% group_by(<var_1>, <var_2>, ..., <var_n>) %>% summarize() ``` --- ## `group_by()` caveat Be aware that this only returns the combinations of var_1, var_2, ... that exist in the data. This can lead to erroneous calculations if you are joining tables downstream. To get the Cartesian product: 1. Convert `var1`, `var2`, ... to factors with `factor`.<sup>1</sup> 2. Set `group_by(..., .drop = FALSE)`. .footnote[[1] We'll discuss the pain and pleasure of factors in greater detail next week.] <!-- [2] If you are a masochist. --> --- ## Nonsense from dropping .panelset[ .panel[.panel-name[Code] ``` r departures = flights %>% group_by(carrier, hour) %>% summarize(n_departures = n()) ggplot(departures, aes(x = hour, y = n_departures)) + geom_line() + * facet_wrap(~carrier, scales = 'free_y') ``` ] .panel[.panel-name[Plot] <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-5-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## Keep (wacky factors) .panelset[ .panel[.panel-name[Code] ``` r departures2 = flights %>% * mutate(carrier = factor(carrier), hour = factor(hour)) %>% * group_by(carrier, hour, .drop = FALSE) %>% summarize(n_departures = n()) plt = ggplot(departures2, aes(x = hour, y = n_departures, group = carrier)) + geom_line() + facet_wrap(~carrier, scales = 'free_y') plt ``` ] .panel[.panel-name[Plot] <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-6-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## Keep (preserve original) .panelset[ .panel[.panel-name[Code] ``` r departures3 = flights %>% mutate(carrier = factor(carrier), hourf = factor(hour)) %>% group_by(carrier, hourf, .drop = FALSE) %>% summarize(n_departures = n()) %>% * mutate(hour = as.numeric(as.character(hourf))) ``` ``` ## `summarise()` has grouped output by 'carrier'. You can override ## using the `.groups` argument. ``` ``` r *(plt %+% departures3) + aes(x = hour) ``` ] .panel[.panel-name[Plot] <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-7-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## grouping is contagious Note the message: > `summarise()` has grouped output by 'carrier'. You can override using the `.groups` argument. This is telling us that by default, the output will be grouped if we use it in another pipeline. To save screen real estate, I'm going to turn off this message going forward: ``` r options(dplyr.summarise.inform = FALSE) ``` In general you should set `.groups` explicitly. --- ## `ungroup()` to remove grouping variables. .pull-left[ Default ``` r ontime_drop = ontime %>% group_by(carrier, dest) %>% summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100) ontime_drop %>% summarize(n_dest = n_distinct(dest)) ``` ``` ## # A tibble: 16 × 2 ## carrier n_dest ## <chr> <int> ## 1 9E 49 ## 2 AA 19 ## 3 AS 1 ## 4 B6 42 ## # ℹ 12 more rows ``` ] .pull-right[ ungrouped ``` r *ontime_drop %>% ungroup() %>% summarize(n_dest = n_distinct(dest)) ``` ``` ## # A tibble: 1 × 1 ## n_dest ## <int> ## 1 105 ``` ] --- ## `group_by() filter()` We can combine group_by with filter, mutate or arrange to evaluate these verbs on each group. ``` r ontime_1500 = ontime %>% group_by(dest) %>% filter(n() > 1500) %>% mutate(n_flights_dest = n(), .after = 1) ontime_1500 %>% arrange(n_flights_dest) ``` ``` ## # A tibble: 314,599 × 21 ## # Groups: dest [53] ## year n_flights_dest month day dep_time sched_dep_time ## <int> <int> <int> <int> <int> <int> ## 1 2013 1525 1 1 1304 1227 ## 2 2013 1525 1 1 1732 1630 ## 3 2013 1525 1 2 834 823 ## 4 2013 1525 1 2 1228 1229 ## # ℹ 314,595 more rows ## # ℹ 15 more variables: dep_delay <dbl>, arr_time <int>, ## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, … ``` --- ## Carrier performance by destination ``` r ontime_drop = ontime_1500 %>% group_by(dest, carrier) %>% summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100) ggplot(ontime_drop, aes(y = ontime_pct, x = dest)) + geom_point() + coord_flip() + theme_minimal(base_size = 8) ``` <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> .question[What's with the carriers with 0% or 100% ontime percentages for a destination?] --- ## The "law" of small numbers ``` r ontime_drop = ontime_1500 %>% group_by(dest, carrier) %>% summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100, n_departed = sum(!is.na(ontime))) ggplot(ontime_drop, aes(y = ontime_pct, x = dest, color = n_departed<10)) + geom_point() + coord_flip() + theme_minimal(base_size = 8) ``` <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> .question[What are some statistical fixes to this issue?] --- ## Without small numbers ``` r ontime_drop_10 = filter(ontime_drop, n_departed >= 10) ggplot(ontime_drop_10, aes(y = ontime_pct, x = dest)) + geom_boxplot() + coord_flip() + theme_minimal(base_size = 8) ``` <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> Better...but it's hard to see the rank order of routes. --- ## A sensible plot order ``` r ontime_drop_10 = ontime_drop_10 %>% ungroup() %>% * mutate(dest = forcats::fct_reorder(dest, ontime_pct)) ggplot(ontime_drop_10, aes(x = dest, y = ontime_pct)) + geom_boxplot() + coord_flip() + theme_minimal(base_size = 8) ``` <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> We'll discuss the `fct_reorder` business in greater detail next week. --- ## `select()` <!-- 201 --> `select()`, `pivot_wider()`, `pivot_longer()`, and friends use [tidy-select](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html), which is a meta-language to select columns from a data frame. There are several helper functions you can use within `select()`: - `starts_with("abc")`: matches names that begin with "abc", - `ends_with("xyz")`: matches names that end with "xyz", - `contains("ijk")`: matches names that contain "ijk", - `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`, - `last_col(offset = 0)`: last column, minus offset, - `where(<function>)`: columns where `<function>` returns `TRUE`, - `:` range, `c` concatenate, `!` negate, `&` intersect - ...and a few others. --- ## select delay columns ``` r flights %>% select(ends_with('delay')) ``` ``` ## # A tibble: 336,776 × 2 ## dep_delay arr_delay ## <dbl> <dbl> ## 1 2 11 ## 2 4 20 ## 3 2 33 ## 4 -1 -18 ## # ℹ 336,772 more rows ``` --- ## everything except columns containing "time" ``` r flights %>% select(!contains('time')) ``` ``` ## # A tibble: 336,776 × 13 ## year month day dep_delay arr_delay carrier flight tailnum ## <int> <int> <int> <dbl> <dbl> <chr> <int> <chr> ## 1 2013 1 1 2 11 UA 1545 N14228 ## 2 2013 1 1 4 20 UA 1714 N24211 ## 3 2013 1 1 2 33 AA 1141 N619AA ## 4 2013 1 1 -1 -18 B6 725 N804JB ## # ℹ 336,772 more rows ## # ℹ 5 more variables: origin <chr>, dest <chr>, distance <dbl>, ## # hour <dbl>, minute <dbl> ``` --- ## character columns ``` r flights %>% select(where(is.character)) ``` ``` ## # A tibble: 336,776 × 4 ## carrier tailnum origin dest ## <chr> <chr> <chr> <chr> ## 1 UA N14228 EWR IAH ## 2 UA N24211 LGA IAH ## 3 AA N619AA JFK MIA ## 4 B6 N804JB JFK BQN ## # ℹ 336,772 more rows ``` --- ## Columns with no missing values ``` r flights %>% select(where(~!any(is.na(.x)))) ``` ``` ## # A tibble: 336,776 × 13 ## year month day sched_dep_time sched_arr_time carrier flight ## <int> <int> <int> <int> <int> <chr> <int> ## 1 2013 1 1 515 819 UA 1545 ## 2 2013 1 1 529 830 UA 1714 ## 3 2013 1 1 540 850 AA 1141 ## 4 2013 1 1 545 1022 B6 725 ## # ℹ 336,772 more rows ## # ℹ 6 more variables: origin <chr>, dest <chr>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` The above uses a tidyverse way to define an **anonymous function** and is equivalent to ``` none_na = function(x) !any(is.na(x)) ``` --- class: middle .hand[# Repeating an operation with `across`] --- ## `across()` If you need to repeat the same operation on multiple columns, you can avoid many keystrokes by using across. It's syntax is a bit idiomatic, but in general, it looks something like ```r verb(across(.cols = <tidy selection>, .funs = list(<fun1>, <fun2>, ...))) ``` where `verb` is `summarize`, `mutate`, and a few others where unquoted columns are provided to a transformation. --- ## NYC AirBNB data ``` r nycbnb ``` ``` ## # A tibble: 12,773 × 16 ## id name host_id price neighborhood accommodates room_type ## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr> ## 1 2595 Skylit… 2845 150 Midtown 1 Entire h… ## 2 3831 Whole … 4869 75 Bedford-Stu… 3 Entire h… ## 3 5121 BlissA… 7356 60 Bedford-Stu… 2 Private … ## 4 5136 Spacio… 7378 275 Sunset Park 4 Entire h… ## # ℹ 12,769 more rows ## # ℹ 9 more variables: borough <chr>, minimum_nights <dbl>, ## # maximum_nights <dbl>, availability_90 <dbl>, beds <dbl>, … ``` --- ## Nights .panelset[ .panel[.panel-name[Code] ``` r availability = nycbnb %>% group_by(neighborhood) %>% * summarize(across(c(maximum_nights, minimum_nights, availability_90), median)) ggplot(availability, aes(x = neighborhood, ymin = minimum_nights, y = availability_90, ymax = maximum_nights)) + geom_pointrange() + coord_flip() + theme_minimal() + labs(x = NULL, y = 'Availability (minimum, 90 day, maximum)') ``` ] .panel[.panel-name[Plot] <img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-22-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## Transform by variable type This can be very powerful: ``` r median_mad = nycbnb %>% group_by(neighborhood) %>% * summarise(across(where(is.numeric), * list(median = median, mad = mad))) glimpse(median_mad) ``` ``` ## Rows: 41 ## Columns: 23 ## $ neighborhood <chr> "Astoria", "Bedford-Stuyve… ## $ id_median <dbl> 26455688, 29970716, 337258… ## $ id_mad <dbl> 22196285, 22187834, 188431… ## $ host_id_median <dbl> 48372942, 39365483, 502829… ## $ host_id_mad <dbl> 66969215, 56397777, 705862… ## $ price_median <dbl> 75.0, 100.0, 75.0, 98.0, 2… ## $ price_mad <dbl> 35.5824, 66.7170, 44.4780,… ## $ accommodates_median <dbl> 2.0, 2.0, 2.0, 3.0, 2.0, 3… ## $ accommodates_mad <dbl> 1.4826, 1.4826, 1.4826, 1.… ## $ minimum_nights_median <dbl> 14.0, 5.0, 5.0, 2.0, 5.0, … ## $ minimum_nights_mad <dbl> 19.2738, 5.9304, 5.9304, 1… ## $ maximum_nights_median <dbl> 1125, 365, 365, 365, 365, … ## $ maximum_nights_mad <dbl> 0.0000, 526.3230, 526.3230… ## $ availability_90_median <dbl> 50.0, 38.0, 47.0, 70.0, 25… ## $ availability_90_mad <dbl> 41.5128, 40.0302, 40.0302,… ## $ beds_median <dbl> NA, NA, NA, NA, NA, 1, NA,… ## $ beds_mad <dbl> NA, NA, NA, NA, NA, 1.4826… ## $ review_scores_rating_median <dbl> 4.830, 4.800, 4.800, 4.810… ## $ review_scores_rating_mad <dbl> 0.252042, 0.266868, 0.2965… ## $ number_of_reviews_median <dbl> 17.0, 20.0, 13.0, 17.0, 13… ## $ number_of_reviews_mad <dbl> 22.2390, 26.6868, 17.7912,… ## $ number_of_reviews_ltm_median <dbl> 2.0, 2.0, 2.0, 4.0, 3.0, 3… ## $ number_of_reviews_ltm_mad <dbl> 2.9652, 2.9652, 2.9652, 5.… ```