Lab 06

Load packages and data

library(reshape2)
library(tidyverse)
library(lubridate)
cases = readr::read_csv(gsub("[ \n]", "",
  "https://urmc-bst.github.io/
   bst430-fall2021-site/hw_lab_instruction/lab06-covid-times/data/us-states.csv"))
## Rows: 32318 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (2): geoid, state
## dbl  (6): cases, cases_avg, cases_avg_per_100k, deaths, deaths_avg, deaths_a...
## date (1): date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Ex1

sub_counties = filter(cases, state %in% c("Alabama", "Arizona", 'California', 'Florida', 'New York', 'Texas'), date >= as.Date('2020-03-01'))
sub_county_long = pivot_longer(sub_counties, c(cases, deaths), names_to = 'stat')
plt = ggplot(sub_county_long, aes(x = date, y = value, color = stat)) + 
  geom_line(alpha = .5, position = 'stack') + facet_wrap(~state+stat, scales = 'free_y') + theme_minimal() +
  scale_color_manual(values = c('purple', 'brown')) + theme(axis.text.x = element_text(angle = 45))
plt

It was my original figure, so I get the exact same thing. You don’t need to get this figure exactly.

Ex2

sub_counties_long = filter(sub_county_long, date >= as.Date('2021-08-01'))
plt = ggplot(sub_counties_long, aes(x = date, y = value, color = stat)) + 
  geom_line(alpha = .5, position = 'stack') + facet_wrap(~state+stat, scales = 'free_y') + theme_minimal() +
  scale_color_manual(values = c('purple', 'brown')) + theme(axis.text.x = element_text(angle = 45))
plt

In these plots, we can see a weekly pattern, each Monday there’s a big spike after low/zero values on the weekend.

Ex3

ggplot(sub_counties_long %>% filter(stat=="cases"), aes(x = date, y = value, color = state)) + 
  geom_line(alpha = .5, position = 'stack') + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(y="cases")

ggplot(sub_counties_long %>% filter(stat=="deaths"), aes(x = date, y = value, color = state)) + 
  geom_line(alpha = .5, position = 'stack') + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(y="deaths")

By showing all the states on top of each other, we can see how clearly the weekly pattern is the same for all six states. We can also see that the width of one week is consistent for this two and a helf months.

Ex4

sub_average_long = pivot_longer(sub_counties, c(cases_avg_per_100k, deaths_avg_per_100k), names_to = 'stat')
plt = ggplot(sub_average_long, aes(x = date, y = value, color = stat)) + 
  geom_line(alpha = .5, position = 'stack') + facet_wrap(~state+stat, scales = 'free_y') + theme_minimal() +
  scale_color_manual(values = c('purple', 'brown')) + theme(axis.text.x = element_text(angle = 45))
plt

This plot doesn’t have so much of a weekly pattern.

Ex5

We need to use the wide format data, so it’s easiest to use the version before we made it longer, but if we didn’t have that we could always make it into wide format with pivot_wider.

  • Upon what day did the greatest total number of cases per capita occur (summing across the 6 states)

Using summarize on the data before is was made longer, sum cases_avg_per_100k over the six states, then find the row (date) which is the max over dates.

  • Upon what day was the inter-state variance the greatest? (again, across the 6 states).

Using summarize on the data before is was made longer, calculate the variance of cases_avg_per_100k over the six states, then find the max over dates.

  • Upon what days did California have more cases per capita than Texas, but less than Florida?

Mutate on the data before is was made longer to create a logical variable which is true when FL>CA>TX and false otherwise. OR just filter to only keep the dates with CA>TX and CA<FL.

Wide, matrix format

Ex6

#the quick, but difficult to think about way
#acast means that the result will be an array/matrix
caseper100_mat = reshape2::acast(sub_counties, date ~ state, value.var = "cases_avg_per_100k")
dim(caseper100_mat)
## [1] 589   6
is.numeric(caseper100_mat)
## [1] TRUE

We see that the data is a matrix with the number of rows that we have days and the six columns for the six states. The matrix is numeric.

Ex7

row_var = function(matrix, na_rm = FALSE) {
  apply(matrix, 1, var, na.rm = na_rm)
}

which.max(rowSums(caseper100_mat, na.rm=T)) %>% names()
## [1] "2021-01-08"
which.max(row_var(caseper100_mat, na_rm=T)) %>% names()
## [1] "2021-08-16"
which(caseper100_mat[,"California"]>caseper100_mat[,"Texas"]&caseper100_mat[,"California"]<caseper100_mat[,"Florida"]) %>% names()
##  [1] "2020-03-21" "2020-03-22" "2020-03-23" "2020-03-24" "2020-03-25"
##  [6] "2020-03-26" "2020-03-27" "2020-03-28" "2020-03-29" "2020-03-30"
## [11] "2020-03-31" "2020-04-01" "2020-04-02" "2020-04-03" "2020-04-04"
## [16] "2020-04-05" "2020-04-06" "2020-04-07" "2020-04-08" "2020-04-09"
## [21] "2020-04-16" "2020-04-17" "2020-04-18" "2020-04-19" "2020-04-20"
## [26] "2020-04-21" "2020-04-22" "2020-04-23" "2020-06-14" "2020-06-15"
## [31] "2020-11-28" "2020-12-01" "2021-02-18" "2021-02-19" "2021-02-20"
## [36] "2021-02-21" "2021-07-12"

I believe it’s easiest to use the matrix, but that’s similar to the wide-format dataset and the long-format data is not set-up to answer these questions.

Ex8

This could also be done with all the states, but making a similar matrix.

library(pheatmap)
cor_matrix = cor(caseper100_mat, use="pairwise")
pheatmap(cor_matrix, main = "Correlation of Cases per 100k between six states")

The cases per capita are highly correlated between California and Arizona. They are also strongly correlated between Florida, Alabama, and Texas. New York had case rates the most independent of the other states.

Rubric

26 points