Photo by Clark Van Der Beken on Unsplash
In this assignment we’ll look at traffic accidents in New York State. It covers all recorded accidents in NY in 2018 and 2019. Some of the variables were modified for the purposes of this assignment.
Go to the course GitHub organization and locate your homework repo, which should be named hw-2-YOUR_GITHUB_USERNAME
.
Grab the URL of the repo, and clone it in RStudio.
First, open the R Markdown document hw02.Rmd
and Knit it.
Make sure it compiles without errors.
The output will be in the file markdown .md
file with the same name.
Before we introduce the data, let’s warm up with some simple exercises.
We’ll use the tidyverse package for much of the data wrangling and visualization, and vroom
to load the .csv. This is purely a convenience to deal with a .csv file compressed with the xz
algorithm as it avoids decompressing it before reading. We’ll also need the lubridate
package to wrangle our dates.
These packages is already installed for you.
You can load them by running the following in your Console:
We can load the data with the following:
crashes = vroom("https://urmc-bst.github.io/bst430-fall2024-site/hw_lab_instruction/hw02-accidents/data/ny_collisions_2018_2019.csv.gz")
You can find out more about the dataset in the NY open data portal: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Case-Information-Three-Year-/e8ky-4vqe . There’s a detailed data dictionary here.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Make a simple table counting occurrences of the Crash Descriptor
. Use this and the existing levels in Crash Descriptor
to make a add another variable called severity
. Make this variable a factor, with shorter, yet descriptive names. Set the factor levels so that they are ordered by severity. In your answer, don’t forget to label your R chunk(s) as well (where it says label-me-1
). Your label should be short, informative, shouldn’t include spaces, and shouldn’t shouldn’t repeat a previous label.
Add a column dt
to crashes
which converts the Date
column to an appropriate an date class using lubridate
.
Add a new a column decimal_hours
that converts Time
into fractional hours since midnight, also using lubridate
.
Recreate the following plot, and describe in context of the data. Describe the patterns you see for Property accidents vs Fatal accidents on the weekdays vs weekends.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Hint: use lubridate::month
to extract the numeric index of the month.
Upon what date did the highest total number of accidents occur? Examine the data and columns provided, and see if you can determine a cause for the date with the highest total number of accidents. In general, what is a possible explanation for the pattern observed between warm-season (May-Oct) and cold-season (Nov-Apr) Total and Fatal accidents?
Create another data visualization based on these data and interpret it. You can choose any variables and any type of visualization you like, but it must have at least three variables, e.g. a scatterplot of x vs. y isn’t enough, but if points are colored or faceted by z, that’s fine. In your answer, don’t forget to label your R chunk as well.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document, and the lintr report on GitHub to make sure you’re happy with the final state of your work.