BST430 Lecture 5b

class: ur-title, center, middle, title-slide

.title[
# BST430 Lecture 5b
]
.subtitle[
## dplyr ii
]
.author[
### Tanzy Love, based on the course by Andrew McDavid
]
.institute[
### U of Rochester
]
.date[
### 2021-09-12 (updated: 2024-09-18 by TL)
]

---

class: middle

.hand[# A lil' bit more dplyr]

---

## Agenda

* grouping by multiple variables
* empty groups and summarizing
* ungrouping
* grouped mutates and filters
* special selection operators
* repeating operations on multiple columns with `across`

Here's the [R code in this lecture](l05a/l05b-advanced-dplyr.R)
---
class: code70

## grouping

As we saw in lecture 3 by adding `group_by()` to a pipeline, we stratify by a categorical variable when we `summarize()`:
.pull-left[
**Grand margin**

``` r
library(nycflights13)
ontime = flights %>% 
  mutate(ontime = arr_delay <= 0)  
ontime %>% 
  summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100)
```

```
## # A tibble: 1 × 1
##   ontime_pct
##        <dbl>
## 1       59.4
```
]

.pull-right[
**stratified by carrier**

``` r
ontime %>% group_by(carrier) %>% 
* summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100)
```

```
## # A tibble: 16 × 2
##   carrier ontime_pct
##   <chr>        <dbl>
## 1 9E            61.6
## 2 AA            66.5
## 3 AS            73.3
## 4 B6            56.3
## # ℹ 12 more rows
```
]

---

## `group_by()` multiple variables

It's easy to group by multiple variables:

```r
flights %>% group_by(<var_1>, <var_2>, ..., <var_n>) %>%
   summarize()
```

---

## `group_by()` caveat

Be aware that this only returns the combinations of var_1, var_2, ... that exist in the data. This can lead to erroneous calculations if you are joining tables downstream. To get the Cartesian product:

1.  Convert `var1`, `var2`, ... to factors with `factor`.<sup>1</sup>
2.  Set  `group_by(..., .drop = FALSE)`.

.footnote[[1] We'll discuss the pain and pleasure of factors in greater detail next week.]

---

## Nonsense from dropping

.panelset[

.panel[.panel-name[Code]

``` r
departures = flights %>% group_by(carrier, hour) %>% summarize(n_departures = n())
ggplot(departures, aes(x = hour, y = n_departures)) + 
  geom_line() + 
* facet_wrap(~carrier, scales = 'free_y')
```
]

.panel[.panel-name[Plot]
<img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-5-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

## Keep (wacky factors)

.panelset[

.panel[.panel-name[Code]

``` r
departures2 = flights %>% 
* mutate(carrier = factor(carrier), hour = factor(hour)) %>%
* group_by(carrier, hour, .drop = FALSE) %>%
  summarize(n_departures = n())
plt = ggplot(departures2, aes(x = hour, y = n_departures, group = carrier)) + 
  geom_line() + 
  facet_wrap(~carrier, scales = 'free_y')
plt
```
]

.panel[.panel-name[Plot]
<img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-6-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

## Keep (preserve original)

.panelset[

.panel[.panel-name[Code]

``` r
departures3 = flights %>% 
  mutate(carrier = factor(carrier), hourf = factor(hour)) %>%
  group_by(carrier, hourf, .drop = FALSE) %>%
  summarize(n_departures = n()) %>%
* mutate(hour = as.numeric(as.character(hourf)))
```

```
## `summarise()` has grouped output by 'carrier'. You can override
## using the `.groups` argument.
```

``` r
*(plt %+% departures3) + aes(x = hour)
```
]

.panel[.panel-name[Plot]
<img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-7-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

## grouping is contagious

Note the message:

>  `summarise()` has grouped output by 'carrier'. You can override using the `.groups` argument.

This is telling us that by default, the output will be grouped if we use it in another pipeline.

To save screen real estate, I'm going to turn off this message going forward:

``` r
options(dplyr.summarise.inform = FALSE)
```

In general you should set `.groups` explicitly.

---

## `ungroup()` to remove grouping variables.

.pull-left[
Default

``` r
ontime_drop = ontime %>% group_by(carrier, dest) %>% summarize(ontime_pct = mean(ontime, na.rm = TRUE)*100)
ontime_drop %>% summarize(n_dest = n_distinct(dest))
```

```
## # A tibble: 16 × 2
##   carrier n_dest
##   <chr>    <int>
## 1 9E          49
## 2 AA          19
## 3 AS           1
## 4 B6          42
## # ℹ 12 more rows
```
]

.pull-right[
ungrouped

``` r
*ontime_drop %>% ungroup() %>%
  summarize(n_dest = n_distinct(dest))
```

```
## # A tibble: 1 × 1
##   n_dest
##    <int>
## 1    105
```
]
---

## `group_by() filter()`

We can combine group_by with filter, mutate or arrange to evaluate these verbs on each group.

``` r
ontime_1500 = ontime %>% group_by(dest) %>% filter(n() > 1500) %>%
  mutate(n_flights_dest = n(), .after = 1)

ontime_1500 %>% arrange(n_flights_dest)
```

```
## # A tibble: 314,599 × 21
## # Groups:   dest [53]
##    year n_flights_dest month   day dep_time sched_dep_time
##   <int>          <int> <int> <int>    <int>          <int>
## 1  2013           1525     1     1     1304           1227
## 2  2013           1525     1     1     1732           1630
## 3  2013           1525     1     2      834            823
## 4  2013           1525     1     2     1228           1229
## # ℹ 314,595 more rows
## # ℹ 15 more variables: dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, …
```

---

## Carrier performance by destination

``` r
ontime_drop = ontime_1500 %>% group_by(dest, carrier) %>%  
  summarize(ontime_pct = mean(ontime, na.rm  = TRUE)*100)
ggplot(ontime_drop, aes(y = ontime_pct, x = dest)) + geom_point() + coord_flip() + theme_minimal(base_size = 8)
```

.question[What's with the carriers with 0% or 100% ontime percentages for a destination?]

---

## The "law" of small numbers

``` r
ontime_drop = ontime_1500 %>% group_by(dest, carrier) %>%  
  summarize(ontime_pct = mean(ontime, na.rm  = TRUE)*100,
            n_departed = sum(!is.na(ontime)))

ggplot(ontime_drop, aes(y = ontime_pct, x = dest, color = n_departed<10)) + geom_point() + coord_flip() + theme_minimal(base_size = 8)
```

.question[What are some statistical fixes to this issue?]
---

## Without small numbers

``` r
ontime_drop_10 = filter(ontime_drop, n_departed >= 10)
ggplot(ontime_drop_10, aes(y = ontime_pct, x = dest)) + geom_boxplot() + 
  coord_flip()  + theme_minimal(base_size = 8)
```

Better...but it's hard to see the rank order of routes.

---

## A sensible plot order

``` r
 ontime_drop_10 = ontime_drop_10 %>% ungroup() %>% 
* mutate(dest = forcats::fct_reorder(dest, ontime_pct))
ggplot(ontime_drop_10, aes(x = dest, y = ontime_pct)) + geom_boxplot() + 
  coord_flip()  + theme_minimal(base_size = 8)
```

We'll discuss the `fct_reorder` business in greater detail next week.

---

## `select()`

`select()`, `pivot_wider()`, `pivot_longer()`, and friends use [tidy-select](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html), which is a meta-language to select columns from a data frame. There are several helper functions you can use within `select()`:

-   `starts_with("abc")`: matches names that begin with "abc",
-   `ends_with("xyz")`: matches names that end with "xyz",
-   `contains("ijk")`: matches names that contain "ijk",
-   `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`,
-   `last_col(offset = 0)`: last column, minus offset,
-   `where(<function>)`: columns where `<function>` returns `TRUE`,
-   `:` range, `c` concatenate, `!` negate, `&` intersect
-   ...and a few others.
---

## select delay columns

``` r
flights %>% select(ends_with('delay'))
```

```
## # A tibble: 336,776 × 2
##   dep_delay arr_delay
##       <dbl>     <dbl>
## 1         2        11
## 2         4        20
## 3         2        33
## 4        -1       -18
## # ℹ 336,772 more rows
```
---

## everything except columns containing "time"

``` r
flights %>% select(!contains('time'))
```

```
## # A tibble: 336,776 × 13
##    year month   day dep_delay arr_delay carrier flight tailnum
##   <int> <int> <int>     <dbl>     <dbl> <chr>    <int> <chr>  
## 1  2013     1     1         2        11 UA        1545 N14228 
## 2  2013     1     1         4        20 UA        1714 N24211 
## 3  2013     1     1         2        33 AA        1141 N619AA 
## 4  2013     1     1        -1       -18 B6         725 N804JB 
## # ℹ 336,772 more rows
## # ℹ 5 more variables: origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>
```
---

## character columns

``` r
flights %>% select(where(is.character))
```

```
## # A tibble: 336,776 × 4
##   carrier tailnum origin dest 
##   <chr>   <chr>   <chr>  <chr>
## 1 UA      N14228  EWR    IAH  
## 2 UA      N24211  LGA    IAH  
## 3 AA      N619AA  JFK    MIA  
## 4 B6      N804JB  JFK    BQN  
## # ℹ 336,772 more rows
```

---

## Columns with no missing values

``` r
flights %>% select(where(~!any(is.na(.x))))
```

```
## # A tibble: 336,776 × 13
##    year month   day sched_dep_time sched_arr_time carrier flight
##   <int> <int> <int>          <int>          <int> <chr>    <int>
## 1  2013     1     1            515            819 UA        1545
## 2  2013     1     1            529            830 UA        1714
## 3  2013     1     1            540            850 AA        1141
## 4  2013     1     1            545           1022 B6         725
## # ℹ 336,772 more rows
## # ℹ 6 more variables: origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
The above uses a tidyverse way to define an **anonymous function** and is equivalent to 
```
none_na = function(x) !any(is.na(x))
```

---
class: middle

.hand[# Repeating an operation with `across`]

---

## `across()`

If you need to repeat the same operation on multiple columns, you can avoid many keystrokes by using across.

It's syntax is a bit idiomatic, but in general, it looks something like
```r
verb(across(.cols = <tidy selection>, 
            .funs = list(<fun1>, <fun2>, ...)))
```
where `verb` is `summarize`,  `mutate`, and a few others where unquoted columns are provided to a transformation.

---

## NYC AirBNB data

``` r
nycbnb
```

```
## # A tibble: 12,773 × 16
##      id name    host_id price neighborhood accommodates room_type
##   <dbl> <chr>     <dbl> <dbl> <chr>               <dbl> <chr>    
## 1  2595 Skylit…    2845   150 Midtown                 1 Entire h…
## 2  3831 Whole …    4869    75 Bedford-Stu…            3 Entire h…
## 3  5121 BlissA…    7356    60 Bedford-Stu…            2 Private …
## 4  5136 Spacio…    7378   275 Sunset Park             4 Entire h…
## # ℹ 12,769 more rows
## # ℹ 9 more variables: borough <chr>, minimum_nights <dbl>,
## #   maximum_nights <dbl>, availability_90 <dbl>, beds <dbl>, …
```

---

## Nights

.panelset[

.panel[.panel-name[Code]

``` r
availability = nycbnb %>% group_by(neighborhood) %>% 
* summarize(across(c(maximum_nights, minimum_nights, availability_90), median))
ggplot(availability, 
       aes(x = neighborhood, 
           ymin = minimum_nights,  
           y = availability_90, 
           ymax = maximum_nights)) + 
  geom_pointrange() + coord_flip() + theme_minimal() +  
  labs(x = NULL, y = 'Availability (minimum, 90 day, maximum)')
```
]

.panel[.panel-name[Plot]
<img src="l05b-advanced-dplyr_files/figure-html/unnamed-chunk-22-1.png" width="75%" style="display: block; margin: auto;" />
]
]
---

## Transform by variable type

This can be very powerful:

``` r
median_mad = nycbnb %>% group_by(neighborhood) %>% 
* summarise(across(where(is.numeric),
*                  list(median = median, mad = mad)))
glimpse(median_mad)
```

```
## Rows: 41
## Columns: 23
## $ neighborhood                 <chr> "Astoria", "Bedford-Stuyve…
## $ id_median                    <dbl> 26455688, 29970716, 337258…
## $ id_mad                       <dbl> 22196285, 22187834, 188431…
## $ host_id_median               <dbl> 48372942, 39365483, 502829…
## $ host_id_mad                  <dbl> 66969215, 56397777, 705862…
## $ price_median                 <dbl> 75.0, 100.0, 75.0, 98.0, 2…
## $ price_mad                    <dbl> 35.5824, 66.7170, 44.4780,…
## $ accommodates_median          <dbl> 2.0, 2.0, 2.0, 3.0, 2.0, 3…
## $ accommodates_mad             <dbl> 1.4826, 1.4826, 1.4826, 1.…
## $ minimum_nights_median        <dbl> 14.0, 5.0, 5.0, 2.0, 5.0, …
## $ minimum_nights_mad           <dbl> 19.2738, 5.9304, 5.9304, 1…
## $ maximum_nights_median        <dbl> 1125, 365, 365, 365, 365, …
## $ maximum_nights_mad           <dbl> 0.0000, 526.3230, 526.3230…
## $ availability_90_median       <dbl> 50.0, 38.0, 47.0, 70.0, 25…
## $ availability_90_mad          <dbl> 41.5128, 40.0302, 40.0302,…
## $ beds_median                  <dbl> NA, NA, NA, NA, NA, 1, NA,…
## $ beds_mad                     <dbl> NA, NA, NA, NA, NA, 1.4826…
## $ review_scores_rating_median  <dbl> 4.830, 4.800, 4.800, 4.810…
## $ review_scores_rating_mad     <dbl> 0.252042, 0.266868, 0.2965…
## $ number_of_reviews_median     <dbl> 17.0, 20.0, 13.0, 17.0, 13…
## $ number_of_reviews_mad        <dbl> 22.2390, 26.6868, 17.7912,…
## $ number_of_reviews_ltm_median <dbl> 2.0, 2.0, 2.0, 4.0, 3.0, 3…
## $ number_of_reviews_ltm_mad    <dbl> 2.9652, 2.9652, 2.9652, 5.…
```