HW 01 - Airbnb listings in New York City

Tanzy Love, based on the course by Andrew McDavid

2021-08-29 (updated: 2022-09-15)

Photo by Madeleine Kohler on Unsplash Photo by Madeleine Kohler on Unsplash

Once upon a time, people traveled all over the world, and some stayed in hotels and others chose to stay in other people’s houses that they booked through Airbnb. As in many cities, Airbnb had an impact on the housing market of New York. Using data provided by Airbnb, we can explore how Airbnb availability and prices vary by neighborhood.

Getting started

Warm up

Before we introduce the data, let’s warm up with some simple exercises. Keep an eye out in the instructions for where you are instructed to: 🧶 knit ✅ commit ⬆️ push

Packages

We’ll use the tidyverse package for much of the data wrangling and visualisation. The data lives on the course website, and is loaded below. These packages may be already installed, if you’ve been doing earlier coursework. You can load them, and the nycbnb data by running the following in your Console:

library(tidyverse)
nycbnb = read_csv("https://urmc-bst.github.io/bst430-fall2021-site/hw_lab_instruction/hw-01-airbnb/data/nylistings.csv")

Data

The data is loaded in the first code chunk in your template into an object called nycbnb.

You can view the dataset as a spreadsheet using the View() function. Note that you should not put this function in your R Markdown document, but instead type it directly in the Console, as it pops open a new window (and the concept of popping open a window in a static document doesn’t really make sense…). When you run this in the console, you’ll see the following data viewer window pop up.

View(nycbnb)

You can find out more about the dataset by inspecting its data dictionary, available here: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896, and you can read more about the project that collected it here: http://insideairbnb.com/.

Exercises

Hint: The Markdown Quick Reference sheet has an example of inline R code. You can access it from the Help menu in RStudio. You can also look at the markdown cheatsheet available on the course website.

  1. How many observations (rows) does the dataset have? Instead of hard coding the number in your answer, use inline code.
  2. Run View(nycbnb) in your Console to view the data in the data viewer. What does each row in the dataset represent?

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Each column represents a variable. We can get a list of the variables in the data frame using the names() function.

names(nycbnb)
  1. Create a faceted histogram where each facet represents a neighborhood and displays the distribution of Airbnb prices in that neighborhood. Think critically about whether it makes more sense to stack the facets on top of each other in a column, lay them out in a row, or wrap them around. Along with your visualization, include your reasoning for the layout you chose for your facets.
ggplot(data = ___, mapping = aes(x = ___)) +
  geom_histogram(binwidth = ___) +
  facet_wrap(~___) # or facet_grid...

Let’s de-construct this code:

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files (including the hw-01_files folder) so that your Git pane is cleared up afterwards.

  1. Use a single pipeline to identify the neighborhoods with the top five median listing prices. Then, in another pipeline filter the data for these five neighborhoods and make density plots (geom_density) of the distributions of listing prices in these five neighborhoods. In a third pipeline calculate the minimum, mean, median, standard deviation, IQR, maximum listing price, and the number of listings, in each of these neighborhoods. Use the visualization and the summary statistics to describe the distribution of listing prices in the neighborhoods. (Your answer will include three pipelines, one of which ends in a visualization, and a narrative.)
  2. Create a visualization that will help you compare the distribution of review scores (review_scores_rating) across neighborhoods. You get to decide what type of visualization to create and there is more than one correct answer! In your answer, include a brief interpretation of how Airbnb guests rate properties in general and how the neighborhoods compare to each other in terms of their ratings.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

  1. Does there appear to be a relationship between the review scores (review_scores_rating) and price? Make a plot and explain what you think might be going on.
  2. Can you think of any weaknesses with how price is being defined in this dataset that affect the ability to make conclusions about it and its relationship with location and rating?
  3. [extracredit] Come up with a proposal using other variables present in the data set to ameliorate one or more of the weaknesses identified in 7, and implement it. Choose one of the previous questions and repeat your answer to it to demonstrate if your proposal worked as intended.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.

Rubric: 26 points total