This is the first assignment where you will get graded and feedback on your coding style.*

Getting started

Accept the assignment from github classrooms, then go to the course GitHub organization and locate your homework repo, which should be named hw03-pedestrian-YOUR_GITHUB_USERNAME. Grab the URL of the repo, and clone it in RStudio. First, open the R Markdown document hw03.Rmd and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md file with the same name.

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

Update the YAML, changing the author name to your name, and knit the document.
Commit your changes with a meaningful commit message.
Push your changes to GitHub.
Go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.

Packages

We use the tidyverse, vroom, readxl and janitor packages. They should all be installed by now, but install them if they are missing. You can load them by running the following:

library(tidyverse)
library(vroom)
library(readxl)
library(janitor)

Data

We can load the data with the following:

crashes = vroom("https://urmc-bst.github.io/bst430-fall2021-site/hw_lab_instruction/hw02-accidents/data/ny_collisions_2018_2019.csv.gz")

You can find out more about the dataset in the NY open data portal: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Case-Information-Three-Year-/e8ky-4vqe . There’s a detailed data dictionary here.

Exercises

Convert the names in crashes to snake_case using janitor::clean_names(). Filter the crashes to only include fatal accidents. You should have 1763 observations.
Consider the event_descriptor column. First, convert it to lower case. Then, using str_detect, define the set of events that involve collisions with bicyclists or pedestrians. Mutate crashes to add new variable called is_pedbike that identifies these.
i) Convert values stored in county_name to Title Case (the purpose of this will become clear subsequently, I swear!) ii) Count the number of fatal crashes per county, per is_pedbike iii) Consider the top 20 counties (county_name) with the most fatal crashes in the data set. Make a barchart showing each county and the number of a) bicycle and pedestrian events and b) other events, filling the bars appropriately to show these two categories. You will be graded on having an appropriate sort order for the county, and appropriate axis labels.
Download the county population data for New York from the last census before 2018. Put the file into a sensible place in your rstudio project. Load it using read_csv, clean up the column names using janitor::clean_names, filter it down to relevant rows, and select relevant columns from it.

⊕Hint: you will want to either remove the “County” part of the ctyname in the census data, using functions found in stringr, or mutate a new column in your crashes counts table that appends (glues) “County” onto the county_name variable.

Join the population data to your table of crashes from Ex 3, and repeat your plot from Ex 3, now normalizing the number of events per county by the population. (Fatalities per 100,000 population gives nice units here.) Your top 20 counties ought to be different here. Discuss what you find.
Download the vehicle miles traveled (VMT) per capita data available from the US Department of Transportation. You can read more about it here. Put the file into a sensible place in your rstudio project and load the Urbanized Area sheet into R using readxl. Cleanup the column names.

⊕Hint: your life will be made easier if you construct a “crosswalk” mapping the identifiers between the VMT dataset urbanized_area and county_name from the crashes data, either as a .csv file that you read in with read_csv or using the tibble or tribble function directly in your markdown. Then join the files using the crosswalk. Here’s an example of the first seven rows of such a crosswalk:

Find the 13 rows in the VMT data that are primarily in New York (i.e. ignoring Dansbury and Bridgeport, CT-NY, which honestly do not include much, if any of New York) and contain a non-missing value for vehicle_miles_traveled_per_capita_raw_value using filter and str_detect. Using the list below, identify the counties corresponding to these urbanized_areas, and join this to the table. (This will not be a one-to-one join.) Then join the fixed up VMT table to the fatal crash counts.

Metro area	County
New York-Newark, NY-NJ-CT	Queens
New York-Newark, NY-NJ-CT	Kings
New York-Newark, NY-NJ-CT	New York
New York-Newark, NY-NJ-CT	Bronx
New York-Newark, NY-NJ-CT	Richmond
Rochester, NY	Monroe
Buffalo, NY	Erie
Albany-Schenectady, NY	Albany
Binghamton, NY-PA	Broome
Elmira, NY	Chemung
Glens Falls, NY	Warren
Ithaca, NY	Tompkins
Kingston, NY	Ulster
Poughkeepsie-Newburgh, NY-NJ	Dutchess
Saratoga Springs, NY	Saratoga
Syracuse, NY	Onondaga
Utica, NY	Oneida

Derive the fatalities per 100,000 vehicle miles traveled, per county, using the crash count table Ex 3 and 5. Repeat your plot from Ex 5 (though now you will only have 16 17 counties). Discuss your findings.
Which estimate, if any, would be most informative about the hazard rate of being a pedestrian/cyclist in NY state? What other factors would be helpful in refining your estimate of the hazard?

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document, and the lintr report on GitHub to make sure you’re happy with the final state of your work.

Rubric

37 points total
27 points correct and fully described answers @ 3 points per
5 points adequately describe commit history
5 points for coding style