library(reshape2)
library(tidyverse)
library(lubridate)
cases = readr::read_csv(gsub("[ \n]", "",
"https://urmc-bst.github.io/
bst430-fall2021-site/hw_lab_instruction/lab06-covid-times/data/us-states.csv"))
## Rows: 32318 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): geoid, state
## dbl (6): cases, cases_avg, cases_avg_per_100k, deaths, deaths_avg, deaths_a...
## date (1): date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_counties = filter(cases, state %in% c("Alabama", "Arizona", 'California', 'Florida', 'New York', 'Texas'), date >= as.Date('2020-03-01'))
sub_county_long = pivot_longer(sub_counties, c(cases, deaths), names_to = 'stat')
plt = ggplot(sub_county_long, aes(x = date, y = value, color = stat)) +
geom_line(alpha = .5, position = 'stack') + facet_wrap(~state+stat, scales = 'free_y') + theme_minimal() +
scale_color_manual(values = c('purple', 'brown')) + theme(axis.text.x = element_text(angle = 45))
plt
It was my original figure, so I get the exact same thing. You don’t need to get this figure exactly.
sub_counties_long = filter(sub_county_long, date >= as.Date('2021-08-01'))
plt = ggplot(sub_counties_long, aes(x = date, y = value, color = stat)) +
geom_line(alpha = .5, position = 'stack') + facet_wrap(~state+stat, scales = 'free_y') + theme_minimal() +
scale_color_manual(values = c('purple', 'brown')) + theme(axis.text.x = element_text(angle = 45))
plt
In these plots, we can see a weekly pattern, each Monday there’s a big spike after low/zero values on the weekend.
ggplot(sub_counties_long %>% filter(stat=="cases"), aes(x = date, y = value, color = state)) +
geom_line(alpha = .5, position = 'stack') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45)) +
labs(y="cases")
ggplot(sub_counties_long %>% filter(stat=="deaths"), aes(x = date, y = value, color = state)) +
geom_line(alpha = .5, position = 'stack') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45)) +
labs(y="deaths")
By showing all the states on top of each other, we can see how clearly the weekly pattern is the same for all six states. We can also see that the width of one week is consistent for this two and a helf months.
sub_average_long = pivot_longer(sub_counties, c(cases_avg_per_100k, deaths_avg_per_100k), names_to = 'stat')
plt = ggplot(sub_average_long, aes(x = date, y = value, color = stat)) +
geom_line(alpha = .5, position = 'stack') + facet_wrap(~state+stat, scales = 'free_y') + theme_minimal() +
scale_color_manual(values = c('purple', 'brown')) + theme(axis.text.x = element_text(angle = 45))
plt
This plot doesn’t have so much of a weekly pattern.
We need to use the wide format data, so it’s easiest to use the version before we made it longer, but if we didn’t have that we could always make it into wide format with pivot_wider.
Using summarize on the data before is was made longer, sum
cases_avg_per_100k
over the six states, then find the row
(date) which is the max over dates.
Using summarize on the data before is was made longer, calculate the
variance of cases_avg_per_100k
over the six states, then
find the max over dates.
Mutate on the data before is was made longer to create a logical variable which is true when FL>CA>TX and false otherwise. OR just filter to only keep the dates with CA>TX and CA<FL.
#the quick, but difficult to think about way
#acast means that the result will be an array/matrix
caseper100_mat = reshape2::acast(sub_counties, date ~ state, value.var = "cases_avg_per_100k")
dim(caseper100_mat)
## [1] 589 6
is.numeric(caseper100_mat)
## [1] TRUE
We see that the data is a matrix with the number of rows that we have days and the six columns for the six states. The matrix is numeric.
row_var = function(matrix, na_rm = FALSE) {
apply(matrix, 1, var, na.rm = na_rm)
}
which.max(rowSums(caseper100_mat, na.rm=T)) %>% names()
## [1] "2021-01-08"
which.max(row_var(caseper100_mat, na_rm=T)) %>% names()
## [1] "2021-08-16"
which(caseper100_mat[,"California"]>caseper100_mat[,"Texas"]&caseper100_mat[,"California"]<caseper100_mat[,"Florida"]) %>% names()
## [1] "2020-03-21" "2020-03-22" "2020-03-23" "2020-03-24" "2020-03-25"
## [6] "2020-03-26" "2020-03-27" "2020-03-28" "2020-03-29" "2020-03-30"
## [11] "2020-03-31" "2020-04-01" "2020-04-02" "2020-04-03" "2020-04-04"
## [16] "2020-04-05" "2020-04-06" "2020-04-07" "2020-04-08" "2020-04-09"
## [21] "2020-04-16" "2020-04-17" "2020-04-18" "2020-04-19" "2020-04-20"
## [26] "2020-04-21" "2020-04-22" "2020-04-23" "2020-06-14" "2020-06-15"
## [31] "2020-11-28" "2020-12-01" "2021-02-18" "2021-02-19" "2021-02-20"
## [36] "2021-02-21" "2021-07-12"
I believe it’s easiest to use the matrix, but that’s similar to the wide-format dataset and the long-format data is not set-up to answer these questions.
This could also be done with all the states, but making a similar matrix.
library(pheatmap)
cor_matrix = cor(caseper100_mat, use="pairwise")
pheatmap(cor_matrix, main = "Correlation of Cases per 100k between six states")
The cases per capita are highly correlated between California and Arizona. They are also strongly correlated between Florida, Alabama, and Texas. New York had case rates the most independent of the other states.
26 points