library(tidyverse)
library(lubridate)
library(vroom)
crashes = vroom(
gsub(pattern = "[ \n]", "", "https://urmc-bst.github.io/bst430-fall2021-site
/hw_lab_instruction/hw02-accidents/data/ny_collisions_2018_2019.csv.gz"))
The crashes data set has 881617
temp = crashes %>% group_by(`Crash Descriptor`)
temp %>% summarise(n())
## # A tibble: 4 x 2
## `Crash Descriptor` `n()`
## <chr> <int>
## 1 Fatal Accident 1763
## 2 Injury Accident 77760
## 3 Property Damage & Injury Accident 165297
## 4 Property Damage Accident 636797
crashes = crashes %>%
mutate(severity = fct_recode(`Crash Descriptor`,
Fatal = "Fatal Accident",
"Injury & Property" = "Property Damage & Injury Accident",
"Injury" = "Injury Accident",
"Property" = "Property Damage Accident"))
temp = crashes %>% group_by(severity)
temp %>% summarise(n())
## # A tibble: 4 x 2
## severity `n()`
## <fct> <int>
## 1 Fatal 1763
## 2 Injury 77760
## 3 Injury & Property 165297
## 4 Property 636797
crashes = crashes %>%
mutate(dt = mdy(Date))
Now dt is a date variable.
crashes = crashes %>%
mutate(Time = hms(Time), decimal_hours = hour(Time) + minute(Time)/60)
The new decimal_hours will be nice for plots.
crashes = crashes %>%
mutate(is_weekend = ifelse(`Day of Week` %in% c('Saturday', 'Sunday'), 'weekend', 'weekday'))
crashes %>%
ggplot(aes(x = decimal_hours, fill = severity)) +
geom_density(alpha = 0.5) +
facet_wrap(~is_weekend, ncol = 1) +
theme_minimal() +
scale_fill_viridis_d() + #use the viridis color palette for "severity"
labs(
title = "Number of accidents throughout the day",
subtitle = "By day of week and severity",
x = "Time of day",
y = "Density",
fill = "Severity"
)
During the week, property (and other non-fatal) accidents are more likely at morning and evening rush hours. On the weekends, property accidents are generally during the daylight hours when more people are awake and driving. However, fatal crashes are relatively more likely in the late night hours of the day, even more so on the weekends.
# looks the same but isn't the same
# by_date = crashes %>%
# group_by(dt, severity, .drop = FALSE) %>%
# summarize(n()) %>%
# ungroup() # we don't want to keep the dt&severity groups for the next step
#
# #SO DO IT THIS WAY
by_date = crashes %>% count(dt, severity, .drop = FALSE)
date_severity = by_date %>% group_by(dt) %>% mutate(Total = sum(n)) %>% # add different severities to get Total
pivot_wider(c(dt, Total), names_from = severity, values_from = n) %>%
mutate(m = month(dt), season = case_when(
m >= 11 | m <= 4 ~ 'cold',
TRUE ~ 'warm'))
ggplot(date_severity, aes(x = Total, y = Fatal, color = season)) + geom_jitter(height = .2) +
labs(y = 'N. Fatal', x = 'N. Total',
title = "Daily Fatal Accidents vs Daily Total Accidents",
subtitle = "New York, 2018-2019") +
theme_minimal() + scale_color_manual(labels = c("Nov-Apr", "May-Oct"), values = c('purple', 'orange'))
This plot shows differences between warm and cold-weather driving.
# I actually did this inline, but I show it here so that you can see the code.
date_severity %>% filter(Total>2500) %>% select(dt)
## # A tibble: 1 x 1
## # Groups: dt [1]
## dt
## <date>
## 1 2018-11-15
The date with the most accidents was 2018-11-15.
(an explanation authored by TA: Christian Hammond)
This date coincides with a large early season snowstorm that hit New York, this would surely have made driving conditions more difficult and could account for the large number of accidents that occurred.
The visualization suggests that in the cold months of Nov-Apr there were a higher number of total accidents, but a lower number of fatalities than in the warmer months of May-Oct.
It is reasonable to conjecture that the higher number of total accidents in the cold months is due to harsh weather impacting driving conditions. An explanation for the higher number of fatalities in the warmer months is less apparent. One potential explanation is that more people are out drinking during the warmer months leading to an increase in drunk driving.
source: https://www.weather.gov/bgm/pastWinterNovember162018
Anything with the required complexity is good here.
(an example authored by TA: Christian Hammond)
The plot below shows the percentage of total accidents that occurred during each day of the week from 2018-2019 in Hamilton county and in New York county. The distribution of accidents over the days of the week differ between these two counties, this discrepancy might have to due with the different forms of transportation available in each county.
Friday had the largest difference in percentage of accidents between the two counties, with Hamilton county having a larger number of total accidents. New York County is the most population dense and Hamilton is the least, this difference in population density could be associated with a difference in availability of public transportation. A population dense county like New York county likely has more public transport options than Hamilton County.
It is possible that on Friday nights in New York County people tend to take public transportation when they go out for the night, while in Hamilton county they must drive. This could explain the discrepancy in percentage of accidents on Friday.
source: http://www.usa.com/rank/new-york-state--population-density--county-rank.htm
bycounty_data = crashes %>%
filter(`County Name` == "NEW YORK" |
`County Name` == "HAMILTON") %>%
select(`Day of Week`, `County Name`)
bycounty_plt = ggplot(
bycounty_data,
aes(
x = `Day of Week`,
y = ..prop..,
group = `County Name`,
fill = `County Name`
)
) +
geom_bar(position = "dodge") +
labs(
x = "Day of Week",
y = "Percentage",
title = "Accidents per Day of the Week in Hamilton County and New York County",
subtitle = "New York, 2018-2019"
) +
scale_fill_manual(
labels = c("Hamilton", "New York"),
values = c("HAMILTON" = "#e34a33", "NEW YORK" = "#fdbb84")
) +
theme_minimal()
bycounty_plt