Load packages and data

library(tidyverse)
nycbnb = read_csv("https://urmc-bst.github.io/bst430-fall2021-site/hw_lab_instruction/hw-01-airbnb/data/nylistings.csv")

Exercises

Exercise 1

The NYC AirBNB data has 12773 rows.

Exercise 2

Each row in the NYC AirBNB data represents one property available on AirBNB in NYC.

Exercise 3

I used wrapped facets for these histograms because there are many neighborhoods and they wouldn’t be easy to see on one line.

ggplot(data=nycbnb, mapping=aes(x=price))+
  geom_histogram(binwidth=5)+
  facet_wrap(~neighborhood)

Exercise 4

top.5 = nycbnb %>% group_by(neighborhood) %>%
  summarize(median=median(price)) %>%
  arrange(median) %>% slice_tail(n=5)
top.5
## # A tibble: 5 x 2
##   neighborhood       median
##   <chr>               <dbl>
## 1 Midtown              186 
## 2 Financial District   193 
## 3 Chelsea              200 
## 4 SoHo                 200 
## 5 West Village         222.
nycbnb %>% 
  filter(neighborhood %in% top.5$neighborhood) %>%
  ggplot(mapping=aes(x=price)) + geom_density() +
  facet_wrap(~neighborhood)

nycbnb %>% 
  filter(neighborhood %in% top.5$neighborhood) %>%
  group_by(neighborhood) %>%
  summarize(minimum=min(price),mean=mean(price),
            standarddev=sd(price),IQR=IQR(price), 
            maximum=max(price), n=n())
## # A tibble: 5 x 7
##   neighborhood       minimum  mean standarddev   IQR maximum     n
##   <chr>                <dbl> <dbl>       <dbl> <dbl>   <dbl> <int>
## 1 Chelsea                 49  260.        183.  265.    1000   328
## 2 Financial District      53  212.        120.   50     1000   242
## 3 Midtown                 41  260.        197.  154     1000   589
## 4 SoHo                    73  305.        256.  205     1000   113
## 5 West Village            15  270.        203.  162.    1000   184

The top five median rent neighborhoods are Midtown, Financial District, Chelsea, SoHo, West Village. All neighborhoods have most rents around $200 per night, with a few very high rents. The Financial District is special in that it has less variation becasue there are almost no small and very large rents.

Exercise 5

There is a tendency to rate a whole number (so all subratings are the same). And almost all ratings are 5’s.

The neighborhoods have different numbers of rentals, but the pattern is generally consistent except for the theater district which gets some lower ratings.

ggplot(data=nycbnb, mapping=aes(x=review_scores_rating))+
  geom_histogram(binwidth=.5)

ggplot(data=nycbnb, mapping=aes(x=review_scores_rating))+
  geom_histogram(binwidth=.5)+
  facet_wrap(~neighborhood, scales = 'free_y')

Exercise 6

ggplot(data=nycbnb, mapping=aes(x=review_scores_rating, y=price))+
  geom_point()

There is a bit of a relationship between price and ratings, but the ratings don’t have much variability. Still, the lowest ratings tend be on the cheaper rentals.

Exercise 7

(authored by TA: Josh Marvald)

By using just the price variable we don’t take into account that some neighborhoods may have more listings with multiple beds while other neighborhoods may have more listings with just a single bed. This would make the comparisons between neighborhoods skewed because listings with more beds are usually more expensive. A more apt comparison would be to compare prices between neighborhoods while grouping by number of beds so that we are comparing similar properties in different locations.

Exercise 8 (extra credit)

(authored by TA: Josh Marvald)

From the first density plot we can see that when we compared the prices between the 5 more expensive neighborhoods that we were also comparing differing sets of listings in terms of bed count frequencies. For Chelsea, the financial district, and Midtown the listings with more beds have quite different price distributions.The distributions for Soho and the West Village are more as I would’ve expected - unimodal for each bed count with rooms with more beds being generally more expensive.

From the second plot which shows linear regressions for listings for a given number of beds across the top 5 neighborhoods between price and review scores, we can see that the relationship between price and review score is dependent on the number of beds in each neighborhood. For instance in the financial district it appears that there is a negative association between price and review score for 2 bed listings whereas the association is positive for all other neighborhoods. This same phenomenon exists in Midtown with 1 bed listings. Granted, when including error bounds for the regressions the association can change direction in some cases, but the point stands that the interaction between number of beds and price varies between the neighborhoods.

Moving forward I would probably try to standardize the counts for each number of beds across the neighborhoods in questions to see if the relationship changes between neighborhoods and price and review scores.

# limiting to top 5 neighborhoods and listings with fewer than 7 beds to better visualize relationships
top5 = top.5 #changing to Josh's data name

nycbnb %>%
  filter(beds < 7) %>%
  mutate(beds = factor(beds)) %>%
  filter(neighborhood %in% top5$neighborhood) %>%
  ggplot(., aes(x = price, color = beds)) +
  geom_density() +
  facet_wrap(vars(neighborhood))

nycbnb %>%
  filter(beds < 7) %>%
  mutate(beds = factor(beds)) %>%
  filter(neighborhood %in% top5$neighborhood) %>%
  ggplot(., aes(x = price, y = review_scores_rating, color = factor(beds))) +
  stat_smooth(method = "lm", se = F) + 
  facet_wrap(vars(neighborhood))
## `geom_smooth()` using formula 'y ~ x'

Rubric: 26 points total