BST430 Lecture 5a

.title[
# BST430 Lecture 5a
]
.subtitle[
## Reshaping and combining data frames
]
.author[
### Tanzy Love, based on the course by Andrew McDavid
]
.institute[
### U of Rochester
]
.date[
### 2021-09-06 (updated: 2024-09-16 by TL)
]

---

## Agenda

1.  Reshaping a single data frame
2.  Joining multiple data frames

---
class: middle

# .hand[We...]

---
class: code50
## Data: Sales

Here's the [R code in this lecture](l05a/l05a-reshape-merge.R)

Here's the [customers](l05a/data/sales/customers.csv),
 [prices](l05a/data/sales/prices.csv),
 [biden](l05a/data/trump/biden.csv),
 [professions](l05a/data/scientists/professions.csv),
 [dates](l05a/data/scientists/dates.csv),
 [works](l05a/data/scientists/works.csv),
 [enrollment](l05a/data/students/enrolment.csv),
 [survey](l05a/data/students/survey.csv)

*You have to have a good filepath to the dataset*

```
## # A tibble: 2 × 4
##   cust_id item_1 item_2       item_3
##     <dbl> <chr>  <chr>        <chr> 
## 1       1 bread  milk         banana
## 2       2 milk   toilet paper <NA>
```
]]

--
.pull-right[
### .pink[We want...]

```
## # A tibble: 6 × 3
##   cust_id item_no item        
##     <dbl> <chr>   <chr>       
## 1       1 item_1  bread       
## 2       1 item_2  milk        
## 3       1 item_3  banana      
## 4       2 item_1  milk        
## 5       2 item_2  toilet paper
## 6       2 item_3  <NA>
```
]

---

## A grammar of data tidying

.pull-left[
<img src="l05a/img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- **pivoting** between wide and long data
- splitting and combining character columns
- **nesting** and unnesting columns
- clarifying how `NA`s should be treated
]

---

# Pivoting data

---

## Not this...

---

## but this!

.center[
<img src="l05a/img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]

---

## Wider vs. longer

```
## # A tibble: 2 × 4
##   cust_id item_1 item_2       item_3
##     <dbl> <chr>  <chr>        <chr> 
## 1       1 bread  milk         banana
## 2       2 milk   toilet paper <NA>
```
]

--
.pull-right[
### .pink[longer]
more rows

---

## `pivot_longer()`

``` r
pivot_longer(
* data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

``` r
pivot_longer(
  data, 
* cols,
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
]
.pull-right[

``` r
pivot_longer(
  data, 
  cols, 
* names_to = "name",
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

``` r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
* values_to = "value"
  )
```
]

---

## Customers `$\rightarrow$` purchases

``` r
purchases = customers %>%
* pivot_longer(
*   cols = item_1:item_3,  # variables item_1 to item_3
*   names_to = "item_no",  # column names -> new column called item_no
*   values_to = "item"     # values in columns -> new column called item
*   )

purchases
```

---
class: code50

## Why pivot?

Most likely, because the next step of your analysis needs it

``` r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]
.pull-right[

``` r
purchases %>%
* left_join(prices)
```

```
## # A tibble: 6 × 4
##   cust_id item_no item         price
##     <dbl> <chr>   <chr>        <dbl>
## 1       1 item_1  bread         1   
## 2       1 item_2  milk          0.8 
## 3       1 item_3  banana        0.15
## 4       2 item_1  milk          0.8 
## 5       2 item_2  toilet paper  3   
## 6       2 item_3  <NA>         NA
```
]

---

## Purchases `$\rightarrow$` customers

.pull-left-narrow[
- `data` (as usual)
- `names_from`:  column in the long data pointing to the column names for the wide format
- `values_from`: column in the long data containing the values in the new columns for the wide format
]
.pull-right-wide[

``` r
purchases %>%
* pivot_wider(
*   names_from = item_no,
*   values_from = item
* )
```

```
## # A tibble: 2 × 4
##   cust_id item_1 item_2       item_3
##     <dbl> <chr>  <chr>        <chr> 
## 1       1 bread  milk         banana
## 2       2 milk   toilet paper <NA>
```
]

---

# Case study: Approval rating of Joseph Biden

---

.footnote[
Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/biden-approval-rating/adults/)
]

---

## Data

``` r
biden
```

```
## # A tibble: 835 × 4
##   subgroup date       approval disapproval
##   <chr>    <date>        <dbl>       <dbl>
## 1 Adults   2021-01-22     49.5        30.3
## 2 Adults   2021-01-21     53.6        32.1
## 3 Adults   2021-01-21     46.3        27.6
## 4 Adults   2021-01-23     50.5        31.3
## # ℹ 831 more rows
```

``` r
summary(biden)
```

```
##    subgroup              date               approval    
##  Length:835         Min.   :2021-01-21   Min.   :42.36  
##  Class :character   1st Qu.:2021-03-18   1st Qu.:50.74  
##  Mode  :character   Median :2021-05-10   Median :52.50  
##                     Mean   :2021-05-13   Mean   :52.36  
##                     3rd Qu.:2021-07-10   3rd Qu.:53.74  
##                     Max.   :2021-09-05   Max.   :63.34  
##   disapproval   
##  Min.   :27.55  
##  1st Qu.:38.28  
##  Median :41.27  
##  Mean   :40.94  
##  3rd Qu.:43.27  
##  Max.   :52.83
```

---

## Goal

.pull-left-wide[
<img src="l05a-reshape-merge_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right-narrow[
**Aesthetic mappings:**  
✅  x = `date`  
❌      y = `rating_value`  
❌      color = `rating_type`

**Facet:**  
✅  `subgroup` (Adults and Voters)
]

---

## Pivot

``` r
biden_longer = biden %>%
  pivot_longer(
    cols = c(approval, disapproval),
    names_to = "rating_type",
    values_to = "rating_value"
  )

biden_longer
```

```
## # A tibble: 1,670 × 4
##   subgroup date       rating_type rating_value
##   <chr>    <date>     <chr>              <dbl>
## 1 Adults   2021-01-22 approval            49.5
## 2 Adults   2021-01-22 disapproval         30.3
## 3 Adults   2021-01-21 approval            53.6
## 4 Adults   2021-01-21 disapproval         32.1
## # ℹ 1,666 more rows
```

---

## Plot

``` r
ggplot(biden_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup)
```

---

``` r
ggplot(biden_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
* scale_color_manual(values = c("darkgreen", "orange")) +
* labs(
*   x = "Date", y = "Rating",
*   color = NULL,
*   title = "How (un)popular is Joseph Biden?",
*   subtitle = "Estimates based on polls of all adults and polls of likely/registered voters",
*   caption = "Source: FiveThirtyEight modeling estimates"
* )
```
]

.panel[.panel-name[Plot]
<img src="l05a-reshape-merge_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

``` r
ggplot(biden_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular is Joseph Biden?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
* theme_minimal() +
* theme(legend.position = "bottom")
```
]

.panel[.panel-name[Plot]
<img src="l05a-reshape-merge_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---
class: font120, code50
## Let's try

1. Suppose you have data (available with `library(tidyr)`) on TB incidence as follows:

``` r
table2
```

```
## # A tibble: 12 × 4
##   country      year type          count
##   <chr>       <dbl> <chr>         <dbl>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
## 4 Afghanistan  2000 population 20595360
## # ℹ 8 more rows
```
Reshape this so that for each year and country, cases and population appear in separate columns.
2. Consider this table containing cases by year:

``` r
table4a
```

```
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```
Reshape this so that you can plot the year vs the case count with ggplot.

---

## Other tidyr functionality

* `nest()` and `unnest()` let you embed or expand list columns in a data frame.  .alert[This can be extremely powerful.] But it best illustrated with an example...in a couple weeks.
*  `fill()` imputes missing values by carrying observations down or up in a data frame
*  `complete()` makes implicit missing combinations of variables explicit.  Sometimes useful when building contingency tables.
*  `separate()` and `separate_rows()` can fix some issues with multiple values packed into the same cell of a table.

---

## `fill()`

``` r
sales
```

```
## # A tibble: 16 × 3
##    quarter  year sales
##    <chr>   <dbl> <dbl>
##  1 Q1       2000 66013
##  2 Q2         NA 69182
##  3 Q3         NA 53175
##  4 Q4         NA 21001
##  5 Q1       2001 46036
##  6 Q2         NA 58842
##  7 Q3         NA 44568
##  8 Q4         NA 50197
##  9 Q1       2002 39113
## 10 Q2         NA 41668
## 11 Q3         NA 30144
## 12 Q4         NA 52897
## # ℹ 4 more rows
```
]
.pull-right[

``` r
sales %>% fill(year)
```

```
## # A tibble: 16 × 3
##    quarter  year sales
##    <chr>   <dbl> <dbl>
##  1 Q1       2000 66013
##  2 Q2       2000 69182
##  3 Q3       2000 53175
##  4 Q4       2000 21001
##  5 Q1       2001 46036
##  6 Q2       2001 58842
##  7 Q3       2001 44568
##  8 Q4       2001 50197
##  9 Q1       2002 39113
## 10 Q2       2002 41668
## 11 Q3       2002 30144
## 12 Q4       2002 52897
## # ℹ 4 more rows
```
]

---

## `complete()`

``` r
df
```

```
## # A tibble: 3 × 3
##   group item_id value
##   <dbl>   <dbl> <int>
## 1     1       1     1
## 2     2       2     2
## 3     1       2     3
```
]
.pull-right[

``` r
df %>% complete(group,item_id)
```

```
## # A tibble: 4 × 3
##   group item_id value
##   <dbl>   <dbl> <int>
## 1     1       1     1
## 2     1       2     3
## 3     2       1    NA
## 4     2       2     2
```
]

---

## `separate_rows()`

``` r
df
```

```
## # A tibble: 3 × 3
##       x y     z    
##   <int> <chr> <chr>
## 1     1 a     1    
## 2     2 d,e,f 2,3,4
## 3     3 g,h   5,6
```
]
.pull-right[

``` r
separate_rows(df, y, z, convert = TRUE)
```

```
## # A tibble: 6 × 3
##       x y         z
##   <int> <chr> <int>
## 1     1 a         1
## 2     2 d         2
## 3     2 e         3
## 4     2 f         4
## 5     3 g         5
## 6     3 h         6
```
]

---
class: code70

## `separate()`

``` r
df
```

```
## # A tibble: 3 × 3
##       x y     z    
##   <int> <chr> <chr>
## 1     1 a     1    
## 2     2 d,e,f 2,3,4
## 3     3 g,h   5,6
```
]

``` r
separate(df, y, c('y1', 'y2', 'y3'))
```

```
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1,
## 3].
```

```
## # A tibble: 3 × 5
##       x y1    y2    y3    z    
##   <int> <chr> <chr> <chr> <chr>
## 1     1 a     <NA>  <NA>  1    
## 2     2 d     e     f     2,3,4
## 3     3 g     h     <NA>  5,6
```
]

---

# Merging data frames

---

# .hand[We...]

---

## Data: Women in science

Information on 10 women in science who changed the world

|name               |
|:------------------|
|Ada Lovelace       |
|Marie Curie        |
|Janaki Ammal       |
|Chien-Shiung Wu    |
|Katherine Johnson  |
|Rosalind Franklin  |
|Vera Rubin         |
|Gladys West        |
|Flossie Wong-Staal |
|Jennifer Doudna    |
]

.footnote[
Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world)
]

---

## Inputs

``` r
professions
```

```
## # A tibble: 10 × 2
##    name               profession                        
##    <chr>              <chr>                             
##  1 Ada Lovelace       Mathematician                     
##  2 Marie Curie        Physicist and Chemist             
##  3 Janaki Ammal       Botanist                          
##  4 Chien-Shiung Wu    Physicist                         
##  5 Katherine Johnson  Mathematician                     
##  6 Rosalind Franklin  Chemist                           
##  7 Vera Rubin         Astronomer                        
##  8 Gladys West        Mathematician                     
##  9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna    Biochemist
```
]

``` r
dates
```

```
## # A tibble: 8 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Rosalind Franklin        1920       1958
## 5 Vera Rubin               1928       2016
## 6 Gladys West              1930         NA
## 7 Flossie Wong-Staal       1947         NA
## 8 Jennifer Doudna          1964         NA
```
]

``` r
works
```

```
## # A tibble: 9 × 2
##   name               known_for                                   
##   <chr>              <chr>                                       
## 1 Ada Lovelace       first computer algorithm                    
## 2 Marie Curie        theory of radioactivity,  discovery of elem…
## 3 Janaki Ammal       hybrid species, biodiversity protection     
## 4 Chien-Shiung Wu    confim and refine theory of radioactive bet…
## 5 Katherine Johnson  calculations of orbital mechanics critical …
## 6 Vera Rubin         existence of dark matter                    
## 7 Gladys West        mathematical modeling of the shape of the E…
## 8 Flossie Wong-Staal first scientist to clone HIV and create a m…
## 9 Jennifer Doudna    one of the primary developers of CRISPR, a …
```
]

]

---

## Desired output

```
## # A tibble: 10 × 5
##    name               profession  birth_year death_year known_for
##    <chr>              <chr>            <dbl>      <dbl> <chr>    
##  1 Ada Lovelace       Mathematic…         NA         NA first co…
##  2 Marie Curie        Physicist …         NA         NA theory o…
##  3 Janaki Ammal       Botanist          1897       1984 hybrid s…
##  4 Chien-Shiung Wu    Physicist         1912       1997 confim a…
##  5 Katherine Johnson  Mathematic…       1918       2020 calculat…
##  6 Rosalind Franklin  Chemist           1920       1958 <NA>     
##  7 Vera Rubin         Astronomer        1928       2016 existenc…
##  8 Gladys West        Mathematic…       1930         NA mathemat…
##  9 Flossie Wong-Staal Virologist…       1947         NA first sc…
## 10 Jennifer Doudna    Biochemist        1964         NA one of t…
```

---

## Inputs, reminder

``` r
names(professions)
```

```
## [1] "name"       "profession"
```

``` r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

``` r
names(works)
```

```
## [1] "name"      "known_for"
```
]
.pull-right[

``` r
nrow(professions)
```

```
## [1] 10
```

``` r
nrow(dates)
```

```
## [1] 8
```

``` r
nrow(works)
```

```
## [1] 9
```
]

---

# Joining data frames

---

## Joining data frames

``` r
something_join(x, y)
```

- `left_join()`: all rows from x
- `right_join()`: all rows from y
- `full_join()`: all rows from both x and y
- `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x
- `inner_join()`: all rows from x where there are matching values in y, return 
all combination of multiple matches in the case of multiple matches
- `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x
- ...
 
---

## Setup

For the next few slides...

``` r
x
```

```
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
```
]
.pull-right[

``` r
y
```

```
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="l05a/img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
left_join(x, y)
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```
]

---

## `left_join()`

``` r
professions %>%
* left_join(dates)
```

```
## # A tibble: 10 × 4
##    name               profession            birth_year death_year
##    <chr>              <chr>                      <dbl>      <dbl>
##  1 Ada Lovelace       Mathematician                 NA         NA
##  2 Marie Curie        Physicist and Chemist         NA         NA
##  3 Janaki Ammal       Botanist                    1897       1984
##  4 Chien-Shiung Wu    Physicist                   1912       1997
##  5 Katherine Johnson  Mathematician               1918       2020
##  6 Rosalind Franklin  Chemist                     1920       1958
##  7 Vera Rubin         Astronomer                  1928       2016
##  8 Gladys West        Mathematician               1930         NA
##  9 Flossie Wong-Staal Virologist and Molec…       1947         NA
## 10 Jennifer Doudna    Biochemist                  1964         NA
```

---

## `right_join()`

.pull-left[
<img src="l05a/img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
right_join(x, y)
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```
]

---

## `right_join()`

``` r
professions %>%
* right_join(dates)
```

```
## # A tibble: 8 × 4
##   name               profession             birth_year death_year
##   <chr>              <chr>                       <dbl>      <dbl>
## 1 Janaki Ammal       Botanist                     1897       1984
## 2 Chien-Shiung Wu    Physicist                    1912       1997
## 3 Katherine Johnson  Mathematician                1918       2020
## 4 Rosalind Franklin  Chemist                      1920       1958
## 5 Vera Rubin         Astronomer                   1928       2016
## 6 Gladys West        Mathematician                1930         NA
## 7 Flossie Wong-Staal Virologist and Molecu…       1947         NA
## 8 Jennifer Doudna    Biochemist                   1964         NA
```

---

## `full_join()`

.pull-left[
<img src="l05a/img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
full_join(x, y)
```

```
## # A tibble: 4 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>   
## 4     4 <NA>    y4
```
]

---

## `full_join()`

``` r
dates %>%
* full_join(works)
```

```
## # A tibble: 10 × 4
##    name               birth_year death_year known_for            
##    <chr>                   <dbl>      <dbl> <chr>                
##  1 Janaki Ammal             1897       1984 hybrid species, biod…
##  2 Chien-Shiung Wu          1912       1997 confim and refine th…
##  3 Katherine Johnson        1918       2020 calculations of orbi…
##  4 Rosalind Franklin        1920       1958 <NA>                 
##  5 Vera Rubin               1928       2016 existence of dark ma…
##  6 Gladys West              1930         NA mathematical modelin…
##  7 Flossie Wong-Staal       1947         NA first scientist to c…
##  8 Jennifer Doudna          1964         NA one of the primary d…
##  9 Ada Lovelace               NA         NA first computer algor…
## 10 Marie Curie                NA         NA theory of radioactiv…
```

---

## `inner_join()`

.pull-left[
<img src="l05a/img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
inner_join(x, y)
```

```
## # A tibble: 2 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2
```
]

---

## `inner_join()`

``` r
dates %>%
* inner_join(works)
```

```
## # A tibble: 7 × 4
##   name               birth_year death_year known_for             
##   <chr>                   <dbl>      <dbl> <chr>                 
## 1 Janaki Ammal             1897       1984 hybrid species, biodi…
## 2 Chien-Shiung Wu          1912       1997 confim and refine the…
## 3 Katherine Johnson        1918       2020 calculations of orbit…
## 4 Vera Rubin               1928       2016 existence of dark mat…
## 5 Gladys West              1930         NA mathematical modeling…
## 6 Flossie Wong-Staal       1947         NA first scientist to cl…
## 7 Jennifer Doudna          1964         NA one of the primary de…
```

---

## `semi_join()`

.pull-left[
<img src="l05a/img/semi-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
semi_join(x, y)
```

```
## # A tibble: 2 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2
```
]

---

## `semi_join()`

``` r
dates %>%
* semi_join(works)
```

```
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947         NA
## 7 Jennifer Doudna          1964         NA
```

---

## `anti_join()`

.pull-left[
<img src="l05a/img/anti-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
anti_join(x, y)
```

```
## # A tibble: 1 × 2
##      id value_x
##   <dbl> <chr>  
## 1     3 x3
```
]

---

## `anti_join()`

``` r
dates %>%
* anti_join(works)
```

```
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958
```

---

## Putting it altogether

``` r
professions %>%
  left_join(dates) %>%
  left_join(works)
```

---

# Case study: Student records

---

## Student records

- Have:
  - Enrollment: university records
  - Survey: Student-provided. Missing students who never filled it out. Includes students who dropped the class
- Want: Survey info for all enrolled in class

``` r
enrollment
```

```
## # A tibble: 3 × 2
##      id name           
##   <dbl> <chr>          
## 1     1 Dave Friday    
## 2     2 Hermine        
## 3     3 Sura Selvarajah
```
]
.pull-right[

``` r
survey
```

```
## # A tibble: 4 × 3
##      id name    username            
##   <dbl> <chr>   <chr>               
## 1     2 Hermine bakealongwithhermine
## 2     3 Sura    surasbakes          
## 3     4 Peter   peter_bakes         
## 4     5 Mark    thebakingbuddha
```
]

---

## Student records

``` r
enrollment %>% 
* left_join(survey, by = "id")
```

```
## # A tibble: 3 × 4
##      id name.x          name.y  username            
##   <dbl> <chr>           <chr>   <chr>               
## 1     1 Dave Friday     <NA>    <NA>                
## 2     2 Hermine         Hermine bakealongwithhermine
## 3     3 Sura Selvarajah Sura    surasbakes
```
]

``` r
enrollment %>% 
* anti_join(survey, by = "id")
```

```
## # A tibble: 1 × 2
##      id name       
##   <dbl> <chr>      
## 1     1 Dave Friday
```
]

``` r
survey %>% 
* anti_join(enrollment, by = "id")
```

```
## # A tibble: 2 × 3
##      id name  username       
##   <dbl> <chr> <chr>          
## 1     4 Peter peter_bakes    
## 2     5 Mark  thebakingbuddha
```
]

]

---
class: code50

## Specify key(s) with `by`

``` r
left_join(enrollment, survey)
```

```
## Joining with `by = join_by(id, name)`
```

```
## # A tibble: 3 × 3
##      id name            username            
##   <dbl> <chr>           <chr>               
## 1     1 Dave Friday     <NA>                
## 2     2 Hermine         bakealongwithhermine
## 3     3 Sura Selvarajah <NA>
```
.question[What happened?]
]

``` r
left_join(enrollment, survey, by = 'id')
```

---

## Specify the joining key(s) with `by`

Can also provide the mapping between keys with 
```r
something_join(x, y, c(xkey1 = 'ykey1', xkey2 = 'ykey2'))
```

``` r
survey_rn = survey %>% rename(survey_id = id, first_name = name)
survey_rn
```

```
## # A tibble: 4 × 3
##   survey_id first_name username            
##       <dbl> <chr>      <chr>               
## 1         2 Hermine    bakealongwithhermine
## 2         3 Sura       surasbakes          
## 3         4 Peter      peter_bakes         
## 4         5 Mark       thebakingbuddha
```

---
class: code50

## Specifying joining keys with `by`

``` r
left_join(enrollment, survey_rn, by = c(id = "survey_id"))
```

```
## # A tibble: 3 × 4
##      id name            first_name username            
##   <dbl> <chr>           <chr>      <chr>               
## 1     1 Dave Friday     <NA>       <NA>                
## 2     2 Hermine         Hermine    bakealongwithhermine
## 3     3 Sura Selvarajah Sura       surasbakes
```
]

``` r
right_join(survey_rn, enrollment, by = c(survey_id = "id"))
```

```
## # A tibble: 3 × 4
##   survey_id first_name username             name           
##       <dbl> <chr>      <chr>                <chr>          
## 1         2 Hermine    bakealongwithhermine Hermine        
## 2         3 Sura       surasbakes           Sura Selvarajah
## 3         1 <NA>       <NA>                 Dave Friday
```
]

# Case study: Grocery sales

---
class: code70
## Grocery sales

- Have:
  - Purchases: One row per customer per item, listing purchases they made
  - Prices: One row per item in the store, listing their prices
- Want: Total revenue

``` r
purchases
```

```
## # A tibble: 5 × 2
##   customer_id item        
##         <dbl> <chr>       
## 1           1 bread       
## 2           1 milk        
## 3           1 banana      
## 4           2 milk        
## 5           2 toilet paper
```
]
.pull-right[

``` r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]

---
class: code70

## Grocery sales

``` r
purchases %>% 
* left_join(prices)
```

```
## # A tibble: 5 × 3
##   customer_id item         price
##         <dbl> <chr>        <dbl>
## 1           1 bread         1   
## 2           1 milk          0.8 
## 3           1 banana        0.15
## 4           2 milk          0.8 
## 5           2 toilet paper  3
```
]
.pull-right[

``` r
purchases %>% 
  left_join(prices) %>%
* summarise(total_revenue = sum(price))
```

```
## # A tibble: 1 × 1
##   total_revenue
##           <dbl>
## 1          5.75
```
]
]

``` r
purchases %>% 
  left_join(prices)
```

``` r
purchases %>% 
  left_join(prices) %>%
* group_by(customer_id) %>%
  summarise(total_revenue = sum(price))
```

```
## # A tibble: 2 × 2
##   customer_id total_revenue
##         <dbl>         <dbl>
## 1           1          1.95
## 2           2          3.8
```
]

]

---

## `nycflights13` relational data

``` r
airlines
```

```
## # A tibble: 16 × 2
##   carrier name                    
##   <chr>   <chr>                   
## 1 9E      Endeavor Air Inc.       
## 2 AA      American Airlines Inc.  
## 3 AS      Alaska Airlines Inc.    
## 4 B6      JetBlue Airways         
## 5 DL      Delta Air Lines Inc.    
## 6 EV      ExpressJet Airlines Inc.
## # ℹ 10 more rows
```
]
.panel[.panel-name[airports]
Information about each airport, identified by `faa` code:

``` r
airports
```

```
## # A tibble: 1,458 × 8
##   faa   name                    lat   lon   alt    tz dst   tzone
##   <chr> <chr>                 <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G   Lansdowne Airport      41.1 -80.6  1044    -5 A     Amer…
## 2 06A   Moton Field Municipa…  32.5 -85.7   264    -6 A     Amer…
## 3 06C   Schaumburg Regional    42.0 -88.1   801    -6 A     Amer…
## 4 06N   Randall Airport        41.4 -74.4   523    -5 A     Amer…
## 5 09J   Jekyll Island Airport  31.1 -81.4    11    -5 A     Amer…
## 6 0A9   Elizabethton Municip…  36.4 -82.2  1593    -5 A     Amer…
## # ℹ 1,452 more rows
```
]
.panel[.panel-name[planes]
information about each plane, identified by its `tailnum`:

``` r
planes
```

```
## # A tibble: 3,322 × 9
##   tailnum  year type       manufacturer model engines seats speed
##   <chr>   <int> <chr>      <chr>        <chr>   <int> <int> <int>
## 1 N10156   2004 Fixed win… EMBRAER      EMB-…       2    55    NA
## 2 N102UW   1998 Fixed win… AIRBUS INDU… A320…       2   182    NA
## 3 N103US   1999 Fixed win… AIRBUS INDU… A320…       2   182    NA
## 4 N104UW   1999 Fixed win… AIRBUS INDU… A320…       2   182    NA
## 5 N10575   2002 Fixed win… EMBRAER      EMB-…       2    55    NA
## 6 N105UW   1999 Fixed win… AIRBUS INDU… A320…       2   182    NA
## # ℹ 3,316 more rows
## # ℹ 1 more variable: engine <chr>
```
]
.panel[.panel-name[weather]
Weather at each NYC airport for each hour:

``` r
weather
```

```
## # A tibble: 26,115 × 15
##   origin  year month   day  hour  temp  dewp humid wind_dir
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>
## 1 EWR     2013     1     1     1  39.0  26.1  59.4      270
## 2 EWR     2013     1     1     2  39.0  27.0  61.6      250
## 3 EWR     2013     1     1     3  39.0  28.0  64.4      240
## 4 EWR     2013     1     1     4  39.9  28.0  62.2      250
## 5 EWR     2013     1     1     5  39.0  28.0  64.4      260
## 6 EWR     2013     1     1     6  37.9  28.0  67.2      240
## # ℹ 26,109 more rows
## # ℹ 6 more variables: wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, …
```
]
]
---

## A relational diagram

* `flights` connects to `planes` via a single variable, `tailnum`. 
* `flights` connects to `airports` in two ways: via the `origin` and
  `dest` variables.
* `flights` connects to `weather` via `origin` (the location), and
  `year`, `month`, `day` and `hour` (the time).
]
.pull-right[
![Relational diagram for nycflights13 table](l05a/img/relational-nycflights.png)
]
---

## Let's try

1.  Imagine you wanted to draw (approximately) the route each plane flies from
    its origin to its destination. What variables would you need? What tables
    would you need to combine?

---
## Full solution

[code](l05a/flights_join.R)

---
## Keys

The variables that connect pairs of tables are called __keys__. A key is a variable (or set of variables) that uniquely identifies an observation.

For example, each plane is uniquely identified by its `tailnum`, but identify an observation in `weather` you need five variables: `year`, `month`, `day`, `hour`, and `origin`.

There are two types of keys:

* A __primary key__ uniquely identifies an observation in its own table.
  For example, `planes$tailnum` is a primary key because it uniquely identifies
  each plane in the `planes` table.

* A __foreign key__ uniquely identifies an observation in another table.
  For example, the `flights$tailnum` is a foreign key because it appears in the 
  `flights` table where it matches each flight to a unique plane.

---
class: code70
## Keys

The uniqueness of the keys is an invariant.  Let's test it:

``` r
planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
```

```
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
```

``` r
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
```

```
## # A tibble: 3 × 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2
```
.question[What happened at 1AM on November 3rd?]
---
class: code70

## Where's the primary key in `flights`?

``` r
flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)
```

```
## # A tibble: 29,768 × 5
##    year month   day flight     n
##   <int> <int> <int>  <int> <int>
## 1  2013     1     1      1     2
## 2  2013     1     1      3     2
## 3  2013     1     1      4     2
## 4  2013     1     1     11     3
## 5  2013     1     1     15     2
## 6  2013     1     1     21     2
## # ℹ 29,762 more rows
```
Nope.
]
.pull-right[
Is he hiding in the plane tailnum?

``` r
flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)
```

```
## # A tibble: 64,928 × 5
##    year month   day tailnum     n
##   <int> <int> <int> <chr>   <int>
## 1  2013     1     1 N0EGMQ      2
## 2  2013     1     1 N11189      2
## 3  2013     1     1 N11536      2
## 4  2013     1     1 N11544      3
## 5  2013     1     1 N11551      2
## 6  2013     1     1 N12540      2
## # ℹ 64,922 more rows
```
Not here either.
]

---

## There he is!

``` r
flights %>% 
  count(year, month, day, flight, origin, carrier) %>% 
  filter(n > 1)
```

```
## # A tibble: 0 × 7
## # ℹ 7 variables: year <int>, month <int>, day <int>,
## #   flight <int>, origin <chr>, carrier <chr>, n <int>
```

It always pays to ask (but verify!)

---

## Surrogate keys

If a table lacks a primary key, it's often useful to add one with `mutate()` and `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a __surrogate key__.

---
# Acknowledgments

Adapted from Data science in a box: 
[1](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d09-tidying/u2-d09-tidying.html#1)
[2](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d08-multi-df/u2-d08-multi-df.html#1)