Photo by ProfDEH on Wikimedia
Commons
In this assignment we examine traffic accidents in New York State.
This is the first assignment where you will get graded and feedback on your coding style.*
Accept the assignment from github classrooms, then go to the course
GitHub organization and locate your homework repo, which should be named
hw03-pedestrian-YOUR_GITHUB_USERNAME
. Grab the URL of the
repo, and clone it in RStudio. First, open the R Markdown document
hw03.Rmd
and Knit it. Make sure it compiles without errors.
The output will be in the file markdown .md
file with the
same name.
Before we introduce the data, let’s warm up with some simple exercises.
We use the tidyverse, vroom
,
readxl
and janitor
packages. They should all
be installed by now, but install them if they are missing. You can load
them by running the following:
library(tidyverse)
library(vroom)
library(readxl)
library(janitor)
We can load the data with the following:
= vroom("https://urmc-bst.github.io/bst430-fall2021-site/hw_lab_instruction/hw02-accidents/data/ny_collisions_2018_2019.csv.gz") crashes
You can find out more about the dataset in the NY open data portal: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Case-Information-Three-Year-/e8ky-4vqe . There’s a detailed data dictionary here.
Convert the names in crashes to snake_case
using
janitor::clean_names()
. Filter the crashes to only include
fatal accidents. You should have 1763 observations.
Consider the event_descriptor
column. First, convert
it to lower case. Then, using str_detect
, define the set of
events that involve collisions with bicyclists or pedestrians. Mutate
crashes
to add new variable called is_pedbike
that identifies these.
i) Convert values stored in county_name
to
Title Case (the purpose of this will become clear
subsequently, I swear!) ii) Count the number of fatal crashes
per county, per is_pedbike
iii) Consider the top
20 counties (county_name
) with the most fatal crashes in
the data set. Make a barchart showing each county and the number of a)
bicycle and pedestrian events and b) other events, filling the bars
appropriately to show these two categories. You will be graded on having
an appropriate sort order for the county, and appropriate axis
labels.
Download the county population data for New York from the
last census before 2018. Put the file into a sensible place in your
rstudio project. Load it using read_csv
, clean up the
column names using janitor::clean_names
, filter it down to
relevant rows, and select relevant columns from it.
Hint: you will want to
either remove the “County” part of the ctyname
in the
census data, using functions found in stringr
, or mutate a
new column in your crashes
counts table that appends
(glues) “County” onto the county_name
variable.
Join the population data to your table of crashes from Ex 3, and repeat your plot from Ex 3, now normalizing the number of events per county by the population. (Fatalities per 100,000 population gives nice units here.) Your top 20 counties ought to be different here. Discuss what you find.
Download the vehicle
miles traveled (VMT) per capita data available from the US
Department of Transportation. You can read more about it here.
Put the file into a sensible place in your rstudio project and load the
Urbanized Area
sheet into R using readxl
.
Cleanup the column names.
Hint:
your life will be made easier if you construct a “crosswalk” mapping the
identifiers between the VMT dataset urbanized_area
and
county_name
from the crashes data, either as a .csv file
that you read in with read_csv
or using the
tibble
or tribble
function directly in your
markdown. Then join the files using the crosswalk. Here’s an example of
the first seven rows of such a crosswalk:
vehicle_miles_traveled_per_capita_raw_value
using
filter
and str_detect
. Using the list below,
identify the counties corresponding to these
urbanized_areas
, and join this to the table. (This will not
be a one-to-one join.) Then join the fixed up VMT table to the fatal
crash counts.Metro area | County |
---|---|
New York-Newark, NY-NJ-CT | Queens |
New York-Newark, NY-NJ-CT | Kings |
New York-Newark, NY-NJ-CT | New York |
New York-Newark, NY-NJ-CT | Bronx |
New York-Newark, NY-NJ-CT | Richmond |
Rochester, NY | Monroe |
Buffalo, NY | Erie |
Albany-Schenectady, NY | Albany |
Binghamton, NY-PA | Broome |
Elmira, NY | Chemung |
Glens Falls, NY | Warren |
Ithaca, NY | Tompkins |
Kingston, NY | Ulster |
Poughkeepsie-Newburgh, NY-NJ | Dutchess |
Saratoga Springs, NY | Saratoga |
Syracuse, NY | Onondaga |
Utica, NY | Oneida |
Derive the fatalities per 100,000 vehicle miles traveled, per
county, using the crash count table Ex 3 and 5. Repeat your plot from Ex
5 (though now you will only have 16 17 counties). Discuss your
findings.
Which estimate, if any, would be most informative about the hazard rate of being a pedestrian/cyclist in NY state? What other factors would be helpful in refining your estimate of the hazard?
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document, and the lintr report on GitHub to make sure you’re happy with the final state of your work.