In this lab we revisit the Denny’s and La Quinta Inn and Suites data we visualized in the previous lab.
Go to the course website and accept the lab assignment, clone it in RStudio and open the R Markdown document. Knit the document to make sure it compiles without errors.
Before we introduce the data, let’s warm up with some simple exercises. Update the YAML of your R Markdown file with your information, knit, commit, and push your changes. Make sure to commit with a meaningful commit message. Then, go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.
We’ll use the tidyverse package for much of the data wrangling and visualization and the data lives in the dsbox package, but I’ve commented that out because it didn’t work for all. So either use the library, or load the data from .csv files in your repo. You can load the libaries by running the following in your Console:
library(tidyverse)
library(dsbox)
Remember that the datasets we’ll use are called dennys
and laquinta
from the dsbox package. If
you want to load them in your console, run this:
<- read_csv("data/dennys.csv")
dennys <- read_csv("data/laquinta.csv") laquinta
You can find out more about the datasets by inspecting their
documentation, which you can access by running ?dennys
and
?laquinta
in the Console or using the Help menu in RStudio
to search for dennys
or laquinta
. You can also
find this information here
and here.
dn_ak
. How many Denny’s locations are there in Alaska?= dennys %>%
dn_ak filter(state == "AK")
nrow(dn_ak)
lq_ak
. How many La Quinta locations are there in
Alaska?= laquinta %>%
lq_ak filter(state == "AK")
nrow(lq_ak)
Next we’ll calculate the distance between all Denny’s and all La Quinta locations in Alaska. Let’s take this step by step:
Step 1: There are 3 Denny’s and 2 La Quinta locations in Alaska. (If you answered differently above, you might want to recheck your answers.)
Step 2: Let’s focus on the first Denny’s location. We’ll need to calculate two distances for it: (1) distance between Denny’s 1 and La Quinta 1 and (2) distance between Denny’s 1 and La Quinta (2).
Step 3: Now let’s consider all Denny’s locations.
In order to calculate these distances we need to first restructure
our data to pair the Denny’s and La Quinta locations. To do so, we will
join the two data frames. We have six join options in R. Each of these
join functions take at least three arguments: x
,
y
, and by
.
x
and y
are data frames to joinby
is the variable(s) to join byFour of these join functions combine variables from the two data frames:
These are called mutating joins.
inner_join()
: return all rows from x
where there are matching values in y
, and all columns from
x
and y
.
left_join()
: return all rows from x
,
and all columns from x
and y
. Rows in x with
no match in y will have NA values in the new columns.
right_join()
: return all rows from y
,
and all columns from x
and y
. Rows in y with
no match in x will have NA values in the new columns.
full_join()
: return all rows and all columns from
both x
and y
. Where there are not matching
values, returns NA for the one missing.
And the other two join functions only keep cases from the left-hand
data frame, and are called filtering joins. You can
find out more about the join functions in the help files for any one of
them, e.g. ?full_join
.
In practice we mostly use mutating joins. In this case we want to
keep all rows and columns from both dn_ak
and
lq_ak
data frames. So we will use a
full_join
.
Full join of Denny’s and La Quinta locations in AK
Let’s join the data on Denny’s and La Quinta locations in Alaska, and take a look at what it looks like:
= full_join(dn_ak, lq_ak, by = "state")
dn_lq_ak dn_lq_ak
dn_lq_ak
data
frame? What are the names of the variables in this data frame..x
in the variable names means the variable comes from
the x
data frame (the first argument in the
full_join
call, i.e. dn_ak
), and
.y
means the variable comes from the y
data
frame. These variables are renamed to include .x
and
.y
because the two data frames have the same variables and
it’s not possible to have two variables in a data frame with the exact
same name.
Now that we have the data in the format we wanted, all that is left is to calculate the distances between the pairs.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
One way of calculating the distance between any two points on the earth is to use the Haversine distance formula. This formula takes into account the fact that the earth is not flat, but instead spherical.
This function is not available in R, but we have it saved in a file
called haversine.R
that we can load and then use. (We do
this for you by source
ing the file in the provided
code-chunk.)
= function(long1, lat1, long2, lat2, round = 3) {
haversine # convert to radians
= long1 * pi / 180
long1 = lat1 * pi / 180
lat1 = long2 * pi / 180
long2 = lat2 * pi / 180
lat2
= 6371 # Earth mean radius in km
earth_r
= sin((lat2 - lat1) / 2) ^ 2 +
a cos(lat1) * cos(lat2) * sin((long2 - long1) / 2) ^ 2
= earth_r * 2 * asin(sqrt(a))
d
return(round(d, round)) # distance in km
}
This function takes five arguments:
Still considering the state of Alaska, calculate the distances
between all pairs of Denny’s and La Quinta locations and save this
variable as distance
. Save this variable in
dn_lq_ak
data frame so that you can use it later.
Calculate the minimum distance between a Denny’s and La Quinta for each Denny’s location (recall there are three). To do so we group by Denny’s locations and calculate a new variable that stores the information for the minimum distance.
= dn_lq_ak %>%
dn_lq_ak_mindist group_by(address.x) %>%
summarise(closest = min(distance))
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Repeat the same analysis for New York: (i) filter Denny’s and La Quinta Data Frames for NY, (ii) join these data frames to get a complete list of all possible pairings, (iii) calculate the distances between all possible pairings of Denny’s and La Quinta in NY, (iv) find the minimum distance between each Denny’s and La Quinta location, (v) visualize and describe the distribution of these shortest distances using appropriate summary statistics.
Repeat the same analysis for a state of your choosing, different than the ones we covered so far. Note that there’s a way you could execute steps 8 and 9 simultaneously and generate a faceted visualization. If you understand how to do this, feel free to do so.
Among the states you examined, where is Mitch Hedberg’s joke most likely to hold true? Explain your reasoning.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.
25 points total. * 10 questions @ 2 points for correct and complete answers * 5 points github commit history