Once upon a time, people traveled all over the world, and some stayed in hotels and others chose to stay in other people’s houses that they booked through Airbnb. As in many cities, Airbnb had an impact on the housing market of New York. Using data provided by Airbnb, we can explore how Airbnb availability and prices vary by neighborhood.

Getting started

Accept the github classroom assignment
Go to the bst-urmc GitHub organization and locate your homework repo, which should be named hw-1-YOUR_GITHUB_USERNAME.
Grab the URL of the repo, and start a new project from Github in RStudio.
Open the R Markdown document hw-01.Rmd and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md file with the same name. (You’ll turn in the .md file, and the plots generated along with it).
If you need your github PAT, it’s in your github account here: “https://github.com/settings/tokens”

Warm up

Before we introduce the data, let’s warm up with some simple exercises. ⊕Keep an eye out in the instructions for where you are instructed to: 🧶 knit ✅ commit ⬆️ push

Update the YAML, changing the author name to your name.
🧶knit the document.
✅ Commit your changes (with a meaningful commit message.)
⬆️ Push your changes to GitHub.
Go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling and visualisation. The data lives on the course website, and is loaded below. These packages may be already installed, if you’ve been doing earlier coursework. You can load them, and the nycbnb data by running the following in your Console:

library(tidyverse)
nycbnb = read_csv("https://urmc-bst.github.io/bst430-fall2021-site/hw_lab_instruction/hw-01-airbnb/data/nylistings.csv")

Data

The data is loaded in the first code chunk in your template into an object called nycbnb.

You can view the dataset as a spreadsheet using the View() function. Note that you should not put this function in your R Markdown document, but instead type it directly in the Console, as it pops open a new window (and the concept of popping open a window in a static document doesn’t really make sense…). When you run this in the console, you’ll see the following data viewer window pop up.

View(nycbnb)

You can find out more about the dataset by inspecting its data dictionary, available here: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896, and you can read more about the project that collected it here: http://insideairbnb.com/.

Exercises

⊕Hint: The Markdown Quick Reference sheet has an example of inline R code. You can access it from the Help menu in RStudio. You can also look at the markdown cheatsheet available on the course website.

How many observations (rows) does the dataset have? Instead of hard coding the number in your answer, use inline code.
Run View(nycbnb) in your Console to view the data in the data viewer. What does each row in the dataset represent?

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Each column represents a variable. We can get a list of the variables in the data frame using the names() function.

names(nycbnb)

Create a faceted histogram where each facet represents a neighborhood and displays the distribution of Airbnb prices in that neighborhood. Think critically about whether it makes more sense to stack the facets on top of each other in a column, lay them out in a row, or wrap them around. Along with your visualization, include your reasoning for the layout you chose for your facets.

ggplot(data = ___, mapping = aes(x = ___)) +
  geom_histogram(binwidth = ___) +
  facet_wrap(~___) # or facet_grid...

Let’s de-construct this code:

ggplot() is the function we are using to build our plot, in layers.
In the first layer we always define the data frame as the first argument. Then, we define the mappings between the variables in the dataset and the aesthetics of the plot (e.g. x and y coordinates, colors, etc.).
In the next layer we represent the data with geometric shapes, in this case with a histogram. You should decide what makes a reasonable bin width for the histogram by trying out a few options.
In the final layer we facet the data by neighborhood.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files (including the hw-01_files folder) so that your Git pane is cleared up afterwards.

Use a single pipeline to identify the neighborhoods with the top five median listing prices. Then, in another pipeline filter the data for these five neighborhoods and make density plots (geom_density) of the distributions of listing prices in these five neighborhoods. In a third pipeline calculate the minimum, mean, median, standard deviation, IQR, maximum listing price, and the number of listings, in each of these neighborhoods. Use the visualization and the summary statistics to describe the distribution of listing prices in the neighborhoods. (Your answer will include three pipelines, one of which ends in a visualization, and a narrative.)
Create a visualization that will help you compare the distribution of review scores (review_scores_rating) across neighborhoods. You get to decide what type of visualization to create and there is more than one correct answer! In your answer, include a brief interpretation of how Airbnb guests rate properties in general and how the neighborhoods compare to each other in terms of their ratings.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Does there appear to be a relationship between the review scores (review_scores_rating) and price? Make a plot and explain what you think might be going on.
Can you think of any weaknesses with how price is being defined in this dataset that affect the ability to make conclusions about it and its relationship with location and rating?
[extracredit] Come up with a proposal using other variables present in the data set to ameliorate one or more of the weaknesses identified in 7, and implement it. Choose one of the previous questions and repeat your answer to it to demonstrate if your proposal worked as intended.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.

HW 01 - Airbnb listings in New York City

Tanzy Love, based on the course by Andrew McDavid

2021-08-29 (updated: 2022-09-15)

Getting started

Warm up

Packages

Data

Exercises

Rubric: 26 points total