Load packages and data

library(tidyverse)
library(vroom)
library(janitor)
library(readxl)
library(ggrepel)
crashes = vroom(
  gsub(pattern = "[ \n]", "", "https://urmc-bst.github.io/bst430-fall2021-site
          /hw_lab_instruction/hw02-accidents/data/ny_collisions_2018_2019.csv.gz"))

Exercises

Exercise 1

crashes = crashes %>% janitor::clean_names() %>% filter(crash_descriptor %in% c("Fatal Accident"))
glimpse(crashes)
## Rows: 1,763
## Columns: 12
## $ year                        <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, ~
## $ crash_descriptor            <chr> "Fatal Accident", "Fatal Accident", "Fatal~
## $ time                        <time> 11:25:00, 06:55:00, 17:06:00, 14:36:00, 1~
## $ date                        <chr> "12/31/2019", "12/31/2019", "12/31/2019", ~
## $ day_of_week                 <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday"~
## $ police_report               <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y~
## $ collision_type_descriptor   <chr> "LEFT TURN (3)", "OTHER", "OTHER", "RIGHT ~
## $ county_name                 <chr> "NASSAU", "NASSAU", "NEW YORK", "CHAUTAUQU~
## $ weather_conditions          <chr> "Cloudy", "Cloudy", "Clear", "Snow", "Clou~
## $ pedestrian_bicyclist_action <chr> "Not Applicable", "Crossing, No Signal or ~
## $ event_descriptor            <chr> "Other Motor Vehicle, Collision With", "Pe~
## $ number_of_vehicles_involved <dbl> 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~

Exercise 2

crashes = crashes %>% 
  mutate(event_descriptor = str_to_lower(event_descriptor)) %>% 
  mutate(is_pedbike = if_else(str_detect(event_descriptor, 'ped|bicy'), 
                              'Pedestrian/Bicycle', 'Other'))

We have 1763 fatal crashes distributed as follows:

crashes %>% count(is_pedbike)
## # A tibble: 2 x 2
##   is_pedbike             n
##   <chr>              <int>
## 1 Other               1156
## 2 Pedestrian/Bicycle   607

Exercise 3

crash_count = crashes %>% 
  mutate(county_name = str_to_title(county_name)) %>% 
  mutate(across(c(county_name, is_pedbike), as.factor)) %>% 
  count(county_name, is_pedbike, .drop = FALSE) %>%
  filter( county_name != 'Unknown') %>% 
  replace_na(list(n = 0))

Coercing to a factor and setting .drop = FALSE, then replacing nas is necessary for counties with zero deaths (generally pedestrian) to be included as zeros–it probably doesn’t affect this plot.

crash_count = crash_count %>% 
   mutate(order_county = fct_reorder(county_name, n),
  county_lump = fct_lump_n(order_county, 20, w = n, other_level = "Rest of NY"))

plt = ggplot(crash_count, aes(y =county_lump, fill = is_pedbike, x = n)) + geom_col() + theme_minimal() +
  labs(x = 'Count', y = 'County', title = 'Fatal Crashes NY, 2018-2019',
       fill = 'Type')
plt

Here I sorted by the total number of fatal crashes – other options are possible but would need a justification. As expected, population plays a large effect, though actually Suffolk and Nassau (Long Island) are worse than Manhattan or the Bronx. Rochester and Buffalo are also in the top 10.

crash_wide = crash_count %>% pivot_wider(names_from = is_pedbike, values_from = n, values_fill = 0)

crash_scatter = ggplot(crash_wide, aes(x = Other, y =`Pedestrian/Bicycle`)) +
  geom_point() + 
  geom_text_repel(aes(label = county_name), max.overlaps = 20)
crash_scatter
## Warning: ggrepel: 47 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Here’s a bonus scatter plot (not part of assignment).

Exercise 4

census = read_csv('data/cc-est2019-agesex-36.csv') %>% 
  janitor::clean_names() %>% 
  filter(year == 10) %>% 
  mutate(popestimate, county_name = str_remove(ctyname, ' County')) %>% 
  select(county_name, popestimate)
## Rows: 744 Columns: 96
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (4): SUMLEV, COUNTY, STNAME, CTYNAME
## dbl (92): STATE, YEAR, POPESTIMATE, POPEST_MALE, POPEST_FEM, UNDER5_TOT, UND...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(census)
## Rows: 62
## Columns: 2
## $ county_name <chr> "Albany", "Allegany", "Bronx", "Broome", "Cattaraugus", "C~
## $ popestimate <dbl> 307717, 46639, 1440625, 193100, 77176, 77457, 128372, 8473~

Data should go into a data directory of some sort. (Will mark down a 1/2 point if it’s just dumped into the root of the project)

Exercise 5

crash_count = left_join(crash_count, census, by = 'county_name') %>% 
  mutate(n_per_100k = n/popestimate*100000)

top_20_popcrash = crash_count %>% 
  group_by(county_name) %>% 
  summarize(total_per_100k = sum(n_per_100k)) %>% #total ped/bik + other per county
  ungroup() %>%  #below we pick out 20 largest (minimum of the negative!!)
  mutate(is_top20 = ifelse(min_rank(-total_per_100k)<=20, county_name, "Rest of NY"))
#there is no "max_rank" function :(

Defining a “rest of NY” variable is a bit tricky here, because we can’t just “sum” the normalized fatalities for the rest of NY, we need to average them, because we are expressing the per-capita risk. To do this, we need to define what groups will be averaged together. First we’ll make an indicator for the top 20 counties – this indicator maps the top 20 county names back to themselves, and the remaining counties to “Rest of NY”. For example:

dim(top_20_popcrash) #one value per county
## [1] 62  3
top_20_popcrash %>% slice_head(n = 7)
## # A tibble: 7 x 3
##   county_name total_per_100k is_top20   
##   <chr>                <dbl> <chr>      
## 1 Albany               10.1  Rest of NY 
## 2 Allegany             12.9  Rest of NY 
## 3 Bronx                 4.30 Rest of NY 
## 4 Broome               10.4  Rest of NY 
## 5 Cattaraugus          22.0  Cattaraugus
## 6 Cayuga               18.1  Cayuga     
## 7 Chautauqua           16.4  Rest of NY
crash_100k = crash_count %>% 
  left_join(top_20_popcrash %>% select(-total_per_100k), by = 'county_name') %>% 
  group_by(is_top20, is_pedbike) %>% 
  summarize(n_per_100k = mean(n_per_100k))
## `summarise()` has grouped output by 'is_top20'. You can override using the
## `.groups` argument.
#here the mean for the top20 counties is just the mean over 1, but for the rest of NY it's the mean over many counties

Now we join onto the top20 table, and group by top20 indicator and is_pedbike, and calculate the average number of crashes. So the top 20 counties are averaged amongst themselves, while other counties are averaged together. You could argue that this should be a weighted average, weighting by county population. As it stands this is more of a geographic average risk rather than a per-capita average risk.

(plt %+% crash_100k) + 
  aes(y = fct_reorder(is_top20, n_per_100k), x = n_per_100k) +
  labs(y = 'County', x = "Fatalities per 100k population")

Discussion of plot

Clearly population did play a large effect – now Hamilton county (in the Adirondacks) is #1, along with several other rural counties. These counties often do not have very many pedestrian fatalities compared to the number of other fatalities.

crash_wide_100k = crash_count %>% pivot_wider(id_cols = county_name, names_from = is_pedbike, values_from = n_per_100k, values_fill = 0)

crash_scatter %+% crash_wide_100k

Bonus scatter plot (not assigned)

Exercise 6

vmt_metro = readxl::read_excel('data/THT_Data_508.xlsx', sheet = 3, na = c('[no data]')) %>% 
  janitor::clean_names()
glimpse(vmt_metro)
## Rows: 488
## Columns: 5
## $ urbanized_area                              <chr> "Aberdeen-Bel Air South-Be~
## $ transit_trips_per_capita_raw_value          <dbl> 1.525607, 4.790755, 12.030~
## $ transit_trips_per_capita_score              <dbl> 5.992854, 30.734933, 63.42~
## $ vehicle_miles_traveled_per_capita_raw_value <dbl> 18.99530, 22.57412, 25.296~
## $ vehicle_miles_traveled_per_capita_score     <dbl> 63.930110, 45.026700, 32.9~

Setting na appropriately here means that the numeric columns will be loaded correctly – otherwise we would need to explicitly convert them into numeric variables.

Exercise 7

metro_areas = tribble( ~urbanized_area, ~county_name,
                    "New York-Newark, NY-NJ-CT", c("Queens", "Kings", "New York", "Bronx", "Richmond"),
                     "Rochester, NY", c("Monroe"),
                     "Buffalo, NY", c("Erie"),
                    "Albany-Schenectady, NY", 'Albany',
                    "Binghamton, NY-PA", "Broome",
                    "Elmira, NY", "Chemung",
                    "Glens Falls, NY", "Warren",
                    "Ithaca, NY", "Tompkins",
                    "Kingston, NY", "Ulster",
                    "Poughkeepsie-Newburgh, NY-NJ", "Dutchess",
                    "Saratoga Springs, NY", "Saratoga",
                    "Syracuse, NY", 'Onondaga',
                    "Utica, NY", "Oneida") %>%
                     unnest(cols = c(county_name))

Here I define the cross walk with tibble (actually at first using a list column and unnesting to save some typing). For the students, it is likely more efficient to take the table expressed in the assignment and paste it into an excel sheet, then write it as a .csv file.

metro_areas = read.csv("data/crosswalk.csv")

There may be other hacky answers here – if it involves torturing the vmt_metro table with magic numbers, etc, we will mark off (magic numbers are poor coding style).

vmt_metro_sub = vmt_metro %>% 
  right_join(metro_areas) %>% 
  select(vmt = vehicle_miles_traveled_per_capita_raw_value, county_name) %>% 
  mutate(vmt = as.numeric(vmt))
## Joining, by = "urbanized_area"
glimpse(vmt_metro_sub)
## Rows: 17
## Columns: 2
## $ vmt         <dbl> 26.59938, 30.45939, 22.13561, 22.39429, 31.37578, 18.50067~
## $ county_name <chr> "Albany", "Broome", "Erie", "Chemung", "Warren", "Tompkins~

Exercise 8

adjusted = crash_count %>% 
  left_join(vmt_metro_sub) %>%  
  mutate(n_per_100k_vmt = n_per_100k/vmt)
## Joining, by = "county_name"
(plt %+% filter(adjusted, !is.na(vmt))) + 
  aes(y = fct_reorder(county_name,n_per_100k_vmt), x = n_per_100k_vmt) + 
  labs(x = "Fatalities per 100k miles driven", y = 'County', 
       main = glue::glue('Fatal crashes in {nn} NY Counties, 2018-2019', nn = nrow(adjusted)))

Discussion of plot

This plot isn’t directly comparable to the previous because we only have 17 counties and some of the rural counties might have placed quite highly here, as well as the Long Island counties – it would be an interesting exercise to try to infer the vmt driven for the remaining counties, perhaps based on a regression model.

adjusted_wide = adjusted %>% 
  select(-n_per_100k, -n) %>% 
  pivot_wider(names_from = is_pedbike, values_from = n_per_100k_vmt)
crash_scatter %+% adjusted_wide
## Warning: Removed 45 rows containing missing values (geom_point).
## Warning: Removed 45 rows containing missing values (geom_text_repel).

Here is another bonus scatter plot (not assigned).

Exercise 9

Not important that you match my answer, but that your answer makes sense and is explained.

The fatalities per 100k VMT is most informative for motorists about their marginal risk of driving. For pedestrians/cyclists, we really would like to know how much exposure (miles or hours) pedestrians/cyclists logged in a county in order to adjust for the amount of time they spend on the roads. This would likely make NYC environs appears less hazardous. I imagine that the amount of driving and amount of pedestrian time probably has a non-linear and complicated interaction.

Rubric