class: ur-title, center, middle, title-slide .title[ # BST430 Lecture 4 ] .subtitle[ ## Data types in R ] .author[ ### Tanzy Love, based on the course by Andrew McDavid ] .institute[ ### U of Rochester ] .date[ ### 2021-09-01 (updated: 2024-09-10 by TL) ] --- class: middle # Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ``` r cat_lovers = read_csv("l04/data/cat-lovers.csv") ``` ``` r cat_lovers ``` ``` ## # A tibble: 60 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## # ℹ 56 more rows ``` Here's the [R code in this lecture](l04/l04-data-types-i.R) Here's the [datafile](l04/data/cat-lovers.csv) *You have to have a good filepath to the dataset* --- ## Oh why won't you work?! ``` r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ``` r ?mean ``` <img src="l04/img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ``` r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ``` r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` -- .large[.center[💡!]] --- ## Let's take another look .small[
] --- ## Sometimes you might need to babysit your respondents .midi[ ``` r cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `mutate()`. ## ℹ In argument: `number_of_cats = case_when(...)`. ## Caused by warning: ## ! NAs introduced by coercion ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## Always you need to respect data types ``` r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` <!-- ??? --> <!-- This generates a warning for unknown (case_when specific) reasons --> --- ## Now that we know what we're doing... ``` r *cat_lovers = cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. --- class: middle .hand[.light-blue[now that we have a good motivation for]] .hand[.light-blue[learning about data types in R]] <br> .large[ .hand[.light-blue[let's learn about data types in R!]] ] --- class: middle # Data types --- ## Atomic data types in R These are the fundamental building blocks (**atoms**) of all R vectors (and all data in R is a vector!) - **logical** - **integer** numbers - **double** (real) numbers - **complex** numbers - **character** - and some more, but we won't be discussing these yet. --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ``` r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ``` r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** - floating point numerical values (default numerical type) ``` r typeof(1.335) ``` ``` ## [1] "double" ``` ``` r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** - integer numerical values (indicated with an `L`) ``` r typeof(7L) ``` ``` ## [1] "integer" ``` ``` r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Complex numbers R also natively supports complex numbers, which are their own type: .pull-left[ ``` r roots_of_unity = c(1+0i, -1+0i, 0+1i, 0-1i) typeof(roots_of_unity) ``` ``` ## [1] "complex" ``` ``` r roots_of_unity^2 ``` ``` ## [1] 1+0i 1+0i -1+0i -1+0i ``` ] .pull-right[ ``` r roots_of_unity^4 ``` ``` ## [1] 1+0i 1+0i 1+0i 1+0i ``` ``` r Re(roots_of_unity) ``` ``` ## [1] 1 -1 0 0 ``` ``` r Im(roots_of_unity) ``` ``` ## [1] 0 0 1 -1 ``` ] --- ## Lists **Lists** are 1d objects that can contain any combination of R objects .pull-left[ .small[ ``` r mylist = list("A", 1:4, c(TRUE, FALSE)) mylist ``` ``` ## [[1]] ## [1] "A" ## ## [[2]] ## [1] 1 2 3 4 ## ## [[3]] ## [1] TRUE FALSE ``` ]] .pull-right[ ``` r *str(mylist) ``` ``` ## List of 3 ## $ : chr "A" ## $ : int [1:4] 1 2 3 4 ## $ : logi [1:2] TRUE FALSE ``` ] --- # `str` is our friend .font130[ * shows the *str*ucture of the data * `str` is *nearly* a synonym for `glimpse` * It shows detailed information on the composition of object. * You should reach for it first when you are trying to understand the low-level properties of an R object. ] --- ## Concatenation Vectors can be constructed and **concatenated** using the `c()` function. .pull-left[.small[ ``` r digits = c(1, 2, 3) hello = c("Hello", "World!") greet = c(c("hi", "hello"), c("bye", "jello")) ``` ]] .pull-right[ ``` r str(digits) ``` ``` ## num [1:3] 1 2 3 ``` ``` r str(hello) ``` ``` ## chr [1:2] "Hello" "World!" ``` ``` r str(greet) ``` ``` ## chr [1:4] "hi" "hello" "bye" "jello" ``` ] --- ## Vector length Get the number of entries in a vector with `length(x)`. .pull-left[ ``` r x = c(1, 2, 3) y = character(2) empty_dbl = numeric(0) empty_chr = character(0) ``` ] .pull-right[ ``` r length(x) ``` ``` ## [1] 3 ``` ``` r length(y) ``` ``` ## [1] 2 ``` ``` r length(empty_dbl) ``` ``` ## [1] 0 ``` ``` r length(empty_chr) ``` ``` ## [1] 0 ``` ] --- ## Concatenation Lists can also be concatenated using the `c()` function. .pull-left[ ``` r list1 = list(1, 2, 3) list2 = list( c("Hi!", "I'm a vector", "nested inside", "a list")) cat12 = c(list1, list2) ``` ] .pull-right[ ``` r str(cat12) ``` ``` ## List of 4 ## $ : num 1 ## $ : num 2 ## $ : num 3 ## $ : chr [1:4] "Hi!" "I'm a vector" "nested inside" "a list" ``` ] Compare to `list(list1, list2)` which would nest the two lists rather than join them. --- ## length(c(x, y)) = length(x) + length(y) ``` r length(list1) ``` ``` ## [1] 3 ``` ``` r length(list2) ``` ``` ## [1] 1 ``` ``` r length(c(list1, list2)) ``` ``` ## [1] 4 ``` .question[ What would `length(c(list1, list2))` be when `list1 = list(c(1, 2, 3))`? ] --- ## Named lists We often want to name the elements of a list (can also do this with vectors). This can make reading and accessing the list more straight forward. .small[ ``` r myotherlist = list(A = "hello", B = 1:4, "knock knock" = "who's there?") str(myotherlist) ``` ``` ## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?" ``` ``` r names(myotherlist) ``` ``` ## [1] "A" "B" "knock knock" ``` ``` r myotherlist$B ``` ``` ## [1] 1 2 3 4 ``` ] --- ## unlisting Really, this should be called unnesting. Often, it is [a code smell](https://en.wikipedia.org/wiki/Code_smell) in R, and indicates an issue with how the code was designed ``` r str(myotherlist) ``` ``` ## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?" ``` ``` r unlist(myotherlist, recursive = TRUE) ``` ``` ## A B1 B2 B3 ## "hello" "1" "2" "3" ## B4 knock knock ## "4" "who's there?" ``` .question[What just happened to the integers in `myotherlist$B`?] --- ## Converting between types .hand[with intention...] .pull-left[ ``` r x = 1:3 x ``` ``` ## [1] 1 2 3 ``` ``` r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ``` r y = as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ``` r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types .hand[with intention...] .pull-left[ ``` r x = c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ``` r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ``` r y = as.numeric(x) y ``` ``` ## [1] 1 0 ``` ``` r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types .hand[without intention...] R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ``` r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ``` r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ``` r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ``` r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion Let's give formal names to what we've seen so far: -- - **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` -- - **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector <!-- --- --> --- class: middle # Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ``` r pi / 0 ``` ``` ## [1] Inf ``` ``` r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ``` r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ``` r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] --- ## `NA`s are special ❄️s ``` r x = c(1, 2, 3, 4, NA) ``` ``` r mean(x) ``` ``` ## [1] NA ``` ``` r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ``` r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` --- ## `NA`s are, by default, logical R uses `NA` to represent missing values in its data structures. ``` r typeof(NA) ``` ``` ## [1] "logical" ``` .footnote[There are also ```NA_integer_```, ```NA_real_``` and ```NA_character_```, occasionally needed to avoid warnings or errors about unplanned coercions.] --- ## Mental model for `NA`s - Unlike `NaN`, `NA`s are genuinely unknown values - But that doesn't mean they can't function in a logical way - Let's think about why `NA`s are logical... -- .question[ Why do the following give different answers? ] .pull-left[ ``` r # TRUE or NA TRUE | NA ``` ``` ## [1] TRUE ``` ] .pull-right[ ``` r # FALSE or NA FALSE | NA ``` ``` ## [1] NA ``` ] `\(\rightarrow\)` See next slide for answers... --- - `NA` is unknown, so it could be `TRUE` or `FALSE` .pull-left[ .midi[ - `TRUE | NA` ``` r TRUE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ``` r TRUE | FALSE # if NA was FALSE ``` ``` ## [1] TRUE ``` ] ] .pull-right[ .midi[ - `FALSE | NA` ``` r FALSE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ``` r FALSE | FALSE # if NA was FALSE ``` ``` ## [1] FALSE ``` ] ] - Doesn't make sense for mathematical operations - Makes sense in the context of missing data --- ## Vectors vs. lists .pull-left[ .small[ ``` r x = c(8,4,7) ``` ] .small[ ``` r x[1] ``` ``` ## [1] 8 ``` ] .small[ ``` r x[[1]] ``` ``` ## [1] 8 ``` ] ] -- .pull-right[ .small[ ``` r y = list(8,4,7) ``` ] .small[ ``` r y[2] ``` ``` ## [[1]] ## [1] 4 ``` ] .small[ ``` r y[[2]] ``` ``` ## [1] 4 ``` ] ] -- <br> **Note:** When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online. --- ## Vectors vs lists: the punchline * Plain vectors must be flat and "atomic"--comprised of only a single base R type: `logical`, `integer`, `numeric`, `complex` or `character`. * Lists can be arbitrarily nested and contain any R object. * Both have length. * Both can be named. --- class: middle # R Classes and attributes --- ## types are elemental .pull-left[ **R elements** * ``` r typeof(1) ``` ``` ## [1] "double" ``` * ``` r typeof("A") ``` ``` ## [1] "character" ``` * ``` r typeof(list(1)) ``` ``` ## [1] "list" ``` ] .pull-right[ **Meatspace elements** * hydrogen <img src = "l04/img/320px-Hydrogen_discharge_tube.jpg" width = "48%"> * carbon <img src = "l04/img/Graphite-and-diamond-with-scale.jpg" width = "48%"> * uranium <img src = "l04/img/600px-HEUraniumC.jpg" width = "48%"> ] ??? These types can either be atomic (integer, character, numeric, boolean) or generic (lists). --- class: code70 ## Attributes are add-on properties .pull-left[ **R attributes** * ``` r attributes(1) ``` ``` ## NULL ``` * .scroll-box-10[ ``` r attributes(starwars) ``` ``` ## $names ## [1] "name" "height" "mass" "hair_color" ## [5] "skin_color" "eye_color" "birth_year" "sex" ## [9] "gender" "homeworld" "species" "films" ## [13] "vehicles" "starships" ## ## $row.names ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ## [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 ## [81] 81 82 83 84 85 86 87 ## ## $class ## [1] "tbl_df" "tbl" "data.frame" ``` ]] .pull-right[ **Meatspace attributes** * temperature * pH * pressure ] --- ## classes are compounds `class` is an `attribute` that signifies a **compound type** (made up of multiple elements or compounds). `class` is a special attribute. It affects what flavor of a function is applied. We'll discuss this in greater detail in a few weeks. --- class: code70 ## Classes .pull-left[ **R compounds** * ``` r class(starwars) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` * ``` r class(1) ``` ``` ## [1] "numeric" ``` * ``` r class(ggplot(starwars, aes(x = weight))+ geom_histogram()) ``` ``` ## [1] "gg" "ggplot" ``` ] .pull-right[ **Meatspace compounds** * `\(H_2O\)` <img src = "l04/img/640px-Ice_Block.jpg" width = "50%"> * NaCl <img src = "l04/img/640px-Salt_Farmers.jpg" width = "50%"> * `\(C_8H_{10}N_4O_2\)` <img src = "l04/img/640px-A_small_cup_of_coffee.jpeg" width = "50%"> ] --- class: middle # R Data "sets" --- ## Rectangular Data "sets" in R - A rectangular (spreadsheet-like) data "set" can be one of the following class: + `tibble` + `data.frame` - We'll often work with `tibble`s: + `readr` package (e.g. `read_csv` function) loads data as a `tibble` by default + `tibble`s are part of the tidyverse, so they work well with other packages we are using + they implement safer and more sensible defaults, so are less likely to cause hard to track bugs in your code --- ## Data frames - A data frame is the most commonly used data structure in R, they are just a `list` of equal length vectors (usually atomic). Each vector is treated as a column and elements of the vectors as rows. - A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier. - Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch. --- ## Data frames ``` r df = tibble(x = 1:3, y = c("a", "b", "c")) typeof(df) ``` ``` ## [1] "list" ``` ``` r class(df) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` ``` r str(df) ``` ``` ## tibble [3 × 2] (S3: tbl_df/tbl/data.frame) ## $ x: int [1:3] 1 2 3 ## $ y: chr [1:3] "a" "b" "c" ``` --- ## Data frames (cont.) ``` r attributes(df) ``` ``` ## $class ## [1] "tbl_df" "tbl" "data.frame" ## ## $row.names ## [1] 1 2 3 ## ## $names ## [1] "x" "y" ``` ``` r typeof(df$y) ``` ``` ## [1] "character" ``` --- ## Working with tibbles in pipelines .question[ How many respondents have below average number of cats? ] ``` r mean_cats = cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow() ``` ``` ## [1] 60 ``` .question[ Do you believe this number? Why, why not? ] ??? Why this works within an error or warning is an entirely different question, and relates to the bowels of the `data.frame` methods for the `groupGeneric`. I still can't figure out what method is being dispatched here and why it does what it does... --- ## A result of a pipeline is always a tibble ``` r mean_cats ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ``` r str(mean_cats) ``` ``` ## tibble [1 × 1] (S3: tbl_df/tbl/data.frame) ## $ mean_cats: num 0.833 ``` --- ## `pull()` can be your new best friend But use it sparingly! ``` r mean_cats = cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) %>% pull() cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow() ``` ``` ## [1] 32 ``` -- ``` r mean_cats ``` ``` ## [1] 0.8333333 ``` ``` r class(mean_cats) ``` ``` ## [1] "numeric" ``` --- ## How does tidyverse *want* you to do that code? ``` r cat_lovers %>% filter(number_of_cats < mean(number_of_cats)) %>% nrow() ``` ``` ## [1] 32 ``` This will work if the number that you need to calculate comes from **the same** dataset as the one you are filtering Note: since 2022, someone has added a warning that displays when you run the code as it appears in the notes. I don't think it's useful for identifying the problem here, but at least it might make you look at the results and see if they are correct. > Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0. > ℹ Please use one dimensional logical vectors instead. <!-- > This warning is displayed once every 8 hours. --> <!-- > Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. --> --- class: center, middle # Factors --- ## Factors Factor objects are how R stores data for categorical variables (fixed numbers of discrete values). ``` r x = factor(c("BS", "MS", "PhD", "MS")) ``` ``` r attributes(x) ``` ``` ## $levels ## [1] "BS" "MS" "PhD" ## ## $class ## [1] "factor" ``` ``` r typeof(x) ``` ``` ## [1] "integer" ``` --- ## Read data in as character strings ``` r str(cat_lovers) ``` ``` ## tibble [60 × 3] (S3: tbl_df/tbl/data.frame) ## $ name : chr [1:60] "Bernice Warren" "Woodrow Stone" "Willie Bass" "Tyrone Estrada" ... ## $ number_of_cats: num [1:60] 0 0 1 3 3 2 1 1 0 0 ... ## $ handedness : chr [1:60] "left" "left" "left" "left" ... ``` --- ## But coerce when plotting ``` r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="l04-data-types_files/figure-html/unnamed-chunk-70-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Use forcats to manipulate factors ``` r cat_lovers = cat_lovers %>% mutate(handedness = fct_relevel(handedness, "right", "left", "ambidextrous")) ``` ``` r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="l04-data-types_files/figure-html/unnamed-chunk-72-1.png" width="60%" style="display: block; margin: auto;" /> --- ## .pull-left[ Come for the functionality ] .pull-right[ <img src = "l04/img/forcats.png" width = "30%"> <!--  --> ] ... stay for the logo - R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. For historical reasons many base R functions automatically convert character vectors to factors, and have been heartily cursed by generations of R programmers for this default behavior. - Factors **are** useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The forcats package provides a suite of useful tools that solve common problems with factors. Source: [forcats.tidyverse.org](http://forcats.tidyverse.org/) --- ## Recap - Best to think of data as part of a tibble + This plays nicely with the `tidyverse` as well + Rows are observations, columns are variables - Be careful about data types / classes + Sometimes `R` makes silly assumptions about your data class + `tibble`s have safer defaults, but won't fold laundry for you + Think about your data in context, e.g. 0/1 variable is most likely a `factor` + If a plot/output is not behaving the way you expect, first investigate the data class with `str` + If you are absolutely sure how you want a factor, over-write it so that you don't need to keep having to keep track of it + `mutate` the variable with the correct class --- # Acknowledgments This lecture contains materials adapted from [Mine Çetinkaya-Rundel and colleagues](https://www2.stat.duke.edu/courses/Spring18/Sta199/slides/lec-slides/05b-coding-style-data-types.html#1) and [data science in a box](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d10-data-types/u2-d10-data-types.html#1)