class: ur-title, center, middle, title-slide .title[ # BST430 Lecture 10 ] .subtitle[ ## Text Processing ] .author[ ### Tanzy Love, based on the course by Andrew McDavid ] .institute[ ### U of Rochester ] .date[ ### 2021-09-26 (updated: 2022-10-13 by TL) ] --- Here's the [R code in this lecture](l10/l10-text.R) Here's the [cardiac-dx](l10/data/cardiac-dx.csv) and [sample_sheet2](l10/ae/data/sample_sheet2.csv) *You have to have a good filepath to each dataset* --- ## Cardiac diagnoses ```r diagnoses = read_csv('l10/data/cardiac-dx.csv') diagnoses ``` ``` ## # A tibble: 100 x 3 ## id gender diagnoses ## <dbl> <chr> <chr> ## 1 26108 F {a,d,d} | abdominal situs ambiguous (abdominal he~ ## 2 10949 M {s,d,l} | aortic stenosis - valvar | atrial septa~ ## 3 8090 F {s,l,l} | aortic valve position relative to the p~ ## 4 19800 M {s,l,l} | crisscross atrioventricular valves | de~ ## 5 2708 M aberrant left subclavian artery | hypoplastic mai~ ## 6 14031 F absence of the suprarenal inferior vena cava with~ ## 7 4185 M aortic arch hypoplasia | aortic atresia | coronar~ ## 8 313 F aortic arch hypoplasia | aortic atresia | hypopla~ ## # ... with 92 more rows ``` .question[How to get all the diagnoses of hypoplastic left heart?] --- ## Attempt 1 ```r diagnoses %>% count(diagnoses) %>% arrange(desc(n)) ``` ``` ## # A tibble: 67 x 2 ## diagnoses n ## <chr> <int> ## 1 screener diagnosis: none 26 ## 2 hypoplastic left heart syndrome | screener diagnosis: no~ 5 ## 3 atrial septal defect, secundum | screener diagnosis: atr~ 3 ## 4 aortic atresia | hypoplastic left heart syndrome | mitra~ 2 ## 5 screener diagnosis: none | tetralogy of fallot 2 ## 6 {a,d,d} | abdominal situs ambiguous (abdominal heterotax~ 1 ## 7 {s,d,l} | aortic stenosis - valvar | atrial septal defec~ 1 ## 8 {s,l,l} | aortic valve position relative to the pulmonar~ 1 ## # ... with 59 more rows ``` -- Too many unique diagnoses to make headway with this. --- ## Using string matching ```r filter(diagnoses, str_detect(diagnoses, 'hypoplastic')) ``` ``` ## # A tibble: 50 x 3 ## id gender diagnoses ## <dbl> <chr> <chr> ## 1 26108 F {a,d,d} | abdominal situs ambiguous (abdominal he~ ## 2 10949 M {s,d,l} | aortic stenosis - valvar | atrial septa~ ## 3 19800 M {s,l,l} | crisscross atrioventricular valves | de~ ## 4 2708 M aberrant left subclavian artery | hypoplastic mai~ ## 5 14031 F absence of the suprarenal inferior vena cava with~ ## 6 4185 M aortic arch hypoplasia | aortic atresia | coronar~ ## 7 313 F aortic arch hypoplasia | aortic atresia | hypopla~ ## 8 10792 M aortic arch hypoplasia | aortic atresia | hypopla~ ## # ... with 42 more rows ``` .question[What different sort of hypoplasties are there?] --- ## Split into pieces ```r diagnoses_row = diagnoses %>% tidyr::separate_rows(diagnoses, * sep = " \\| ") #WTH? diagnoses_row ``` ``` ## # A tibble: 412 x 3 ## id gender diagnoses ## <dbl> <chr> <chr> ## 1 26108 F {a,d,d} ## 2 26108 F abdominal situs ambiguous (abdominal heterotaxy) ## 3 26108 F aortic atresia ## 4 26108 F double outlet right ventricle ## 5 26108 F heterotaxy syndrome ## 6 26108 F inferior vena cava, left sided ## 7 26108 F screener diagnosis: other: cavc, aa, hypoplastic ~ ... ``` --- ## Split into pieces ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% count(diagnoses) %>% arrange(desc(n)) ``` ``` ## # A tibble: 15 x 2 ## diagnoses n ## <chr> <int> ## 1 hypoplastic left heart syndrome 21 ## 2 hypoplastic right ventricle (subnormal cavity volume) 8 ## 3 hypoplastic left ventricle (subnormal cavity volume) 7 ## 4 hypoplastic mitral valve 7 ## 5 hypoplastic left pulmonary artery 6 ## 6 hypoplastic main pulmonary artery 6 ## 7 hypoplastic right pulmonary artery 5 ## 8 hypoplastic tricuspid valve 3 ## # ... with 7 more rows ``` --- ## Plot co-occurrence .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% ggplot(aes(y = diagnoses, x = as.factor(id))) + geom_tile() + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ### Sensible factor orders .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% * ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + geom_tile() + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-9-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ### Wrap text width .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% * mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>% ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + geom_tile() + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-10-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ### Adjust text size and justification .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>% ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + geom_tile() + * theme(axis.text.y = element_text(hjust = 0, vjust = 0, size = 8)) + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-11-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- # Topics * Why text processing * Low level processing: * concatenate, count characters, substring, split strings. * Regular expressions (aka regex): * detect, extract, replace * Text mining * Tokenizing, filtering, analysis --- ## Low level text processing * concatenate with `stringr::str_c()` and `glue::glue()` * count characters with `base::nchar()` * extract and replace substrings with `stringr::str_sub()` * split with `str_split_fixed()` (generally) or `str_split()` (less often) --- ## Packages .pull-left[ `stringr` and `glue` rationalize much of text processing, which is otherwise a bit of a thicket in R. ] .pull-right[   ] --- ## Concatenate strings .pull-left[ ```r names = c("Jeff B.", "Larry E.", "Warren B.") favorite_food = c("caviar", "cake", "Pappy Van Winkle") str_c(names, " likes ", #note additional spaces favorite_food, ".") ``` ``` ## [1] "Jeff B. likes caviar." ## [2] "Larry E. likes cake." ## [3] "Warren B. likes Pappy Van Winkle." ``` ] .pull-right[ ```r dinner = glue::glue("{names} likes {favorite_food}.") dinner ``` ``` ## Jeff B. likes caviar. ## Larry E. likes cake. ## Warren B. likes Pappy Van Winkle. ``` ] --- ## Some special characters * \n newline * \r carriage return * \t tab * \f form feed * \Unnnnnnnn Unicode character with given code * \\\ literal backslash (.alert[this one will prove to be especially annoying...]) * \" literal quote Others are listed in `?'"'` (the help on the quote function). --- ## Glue with newlines and unicode ```r glue::glue("{names} \n {favorite_food} \U1F643.") ``` ``` ## Jeff B. ## caviar <U+0001F643>. ## Larry E. ## cake <U+0001F643>. ## Warren B. ## Pappy Van Winkle <U+0001F643>. ``` --- ## Count characters ```r names ``` ``` ## [1] "Jeff B." "Larry E." "Warren B." ``` ```r nchar(names) ``` ``` ## [1] 7 8 9 ``` --- ## Extract substrings .pull-left[ Extract ```r str_sub(dinner, 1, 11) ``` ``` ## [1] "Jeff B. lik" "Larry E. li" "Warren B. l" ``` ] .pull-right[ Replace ```r str_sub(dinner, #space + l- nchar(names) + 2, #space + l-i-k-e nchar(names) + 6 ) = "demands" dinner ``` ``` ## [1] "Jeff B. demands caviar." ## [2] "Larry E. demands cake." ## [3] "Warren B. demands Pappy Van Winkle." ``` ] --- ## split strings Get a character matrix, padding / collapsing excess fields. ```r str_split_fixed(dinner, " ", 4) ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "Jeff" "B." "demands" "caviar." ## [2,] "Larry" "E." "demands" "cake." ## [3,] "Warren" "B." "demands" "Pappy Van Winkle." ``` ```r str_split_fixed(dinner, " ", 6) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] "Jeff" "B." "demands" "caviar." "" "" ## [2,] "Larry" "E." "demands" "cake." "" "" ## [3,] "Warren" "B." "demands" "Pappy" "Van" "Winkle." ``` --- ## split strings Get exactly what you ask for. ```r str_split(dinner, " ") ``` ``` ## [[1]] ## [1] "Jeff" "B." "demands" "caviar." ## ## [[2]] ## [1] "Larry" "E." "demands" "cake." ## ## [[3]] ## [1] "Warren" "B." "demands" "Pappy" "Van" "Winkle." ``` Also recall `tidyr::separate` and `tidyr::separate_rows`. --- ## Other handy low-level string manipulations * Change case `str_to_lower()`, `str_to_upper()`, `str_to_title()` * Remove trailing/leading `str_trim()` or repeated `str_squish()` whitespace * Wrap long lines `stringr::str_wrap()` * Truncate `str_trunc()` or abbreviate `base::abbreviate()` long strings. --- ## Application Exercise .hand[Let's try.] Read some data ```r data = read_csv("l10/ae/data/sample_sheet2.csv") data ``` ``` ## # A tibble: 24 x 6 ## library_id folder_name treat~1 sampl~2 tissu~3 dataset ## <dbl> <chr> <dbl> <chr> <chr> <chr> ## 1 1 Sample_300_0150_PBL~ 300 300_01~ PBL 300_01~ ## 2 2 Sample_300_0150_Syn~ 300 300_01~ Syn 300_01~ ## 3 4 Sample_300_0171_PBL~ 300 300_01~ PBL 300_01~ ## 4 5 Sample_300_0171_Syn~ 300 300_01~ Syn 300_01~ ## 5 6 Sample_300_0173_PBL~ 300 300_01~ PBL 300_01~ ## 6 7 Sample_300_0173_Syn~ 300 300_01~ Syn 300_01~ ## 7 8 Sample_300_0174_PBL~ 300 300_01~ PBL 300_01~ ## 8 9 Sample_300_0174_Syn~ 300 300_01~ Syn 300_01~ ## # ... with 16 more rows, and abbreviated variable names ## # 1: treatment, 2: sample_ID, 3: tissue_source ``` --- Split the `folder_name` field by the '_' character. This will give you a character `matrix`. -- ```r char_matrix = str_split_fixed(data$folder_name, pattern = "_", n = 6) char_matrix ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] "Sample" "300" "0150" "PBL" "BT" "5" ## [2,] "Sample" "300" "0150" "Syn" "BT" "5" ## [3,] "Sample" "300" "0171" "PBL" "BT" "5" ## [4,] "Sample" "300" "0171" "Syn" "BT" "5" ## [5,] "Sample" "300" "0173" "PBL" "BT" "5" ## [6,] "Sample" "300" "0173" "Syn" "BT" "5" ## [7,] "Sample" "300" "0174" "PBL" "BT" "5" ## [8,] "Sample" "300" "0174" "Syn" "BT" "5" ## [9,] "Sample" "300" "0392" "PBL" "BT" "5" ## [10,] "Sample" "300" "0392" "Syn" "BT" "5" ## [11,] "Sample" "300" "0410" "PBL" "BT" "5" ## [12,] "Sample" "300" "0410" "Syn" "BT" "5" ## [13,] "Sample" "300" "0414" "PBL" "BT" "5" ## [14,] "Sample" "300" "0414" "Syn" "BT" "5" ## [15,] "Sample" "300" "0415" "Syn" "BT" "5" ## [16,] "Sample" "300" "0416" "Syn" "BT" "5" ## [17,] "Sample" "300" "1883" "Syn" "BT" "5" ## [18,] "Sample" "300" "1930" "PBL" "BT" "5" ## [19,] "Sample" "300" "1930" "Syn" "BT" "5" ## [20,] "Sample" "301" "0174" "PBL" "BT" "5" ## [21,] "Sample" "301" "0174" "Syn" "BT" "5" ## [22,] "Sample" "301" "0174" "Syn" "PC" "5" ## [23,] "Sample" "301" "0270" "PBL" "BT" "5" ## [24,] "Sample" "301" "0270" "Syn" "BT" "5" ``` --- Using this and `str_c`, recreate the fields `treatment`, `sample_ID`, `tissue_source`. You can extract various columns using `char_matrix[,integer_of_the_column_you_want]`. -- ```r data %>% select(folder_name) %>% mutate(treatment = char_matrix[,2], sample_ID = str_c(char_matrix[,2], "_", char_matrix[,3]), tissue_source = char_matrix[,4]) ``` ``` ## # A tibble: 24 x 4 ## folder_name treatment sample_ID tissue_source ## <chr> <chr> <chr> <chr> ## 1 Sample_300_0150_PBL_BT_5 300 300_0150 PBL ## 2 Sample_300_0150_Syn_BT_5 300 300_0150 Syn ## 3 Sample_300_0171_PBL_BT_5 300 300_0171 PBL ## 4 Sample_300_0171_Syn_BT_5 300 300_0171 Syn ## 5 Sample_300_0173_PBL_BT_5 300 300_0173 PBL ## 6 Sample_300_0173_Syn_BT_5 300 300_0173 Syn ## 7 Sample_300_0174_PBL_BT_5 300 300_0174 PBL ## 8 Sample_300_0174_Syn_BT_5 300 300_0174 Syn ## # ... with 16 more rows ``` --- Convert the `dataset` field to be entirely lowercase with `str_to_lower`. -- ```r data %>% select(dataset) %>% mutate(new_data = str_to_lower(dataset)) ``` ``` ## # A tibble: 24 x 2 ## dataset new_data ## <chr> <chr> ## 1 300_0150_PBL_1 300_0150_pbl_1 ## 2 300_0150_Syn_2 300_0150_syn_2 ## 3 300_0171_PBL_4 300_0171_pbl_4 ## 4 300_0171_Syn_5 300_0171_syn_5 ## 5 300_0173_PBL_6 300_0173_pbl_6 ## 6 300_0173_Syn_7 300_0173_syn_7 ## 7 300_0174_PBL_8 300_0174_pbl_8 ## 8 300_0174_Syn_9 300_0174_syn_9 ## # ... with 16 more rows ``` --- Collapse sample_ID field into a single character vector (length 1) separated by semicolons ";" using `str_c(..., collapse = ...)`. -- ```r str_c(data$sample_ID, collapse=";") ``` ``` ## [1] "300_0150;300_0150;300_0171;300_0171;300_0173;300_0173;300_0174;300_0174;300_0392;300_0392;300_0410;300_0410;300_0414;300_0414;300_0415;300_0416;300_1883;300_1930;300_1930;301_0174;301_0174;301_0174;301_0270;301_0270" ``` --- class: middle .hand[Regular Expressions]: --- ## Regular expressions > Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski (creator of Mozilla) * Are like *find-replace*, *wildcards* \* * Are found in nearly every computer language * Can be just the ticket to solving *some* problems --- class: bg-green ## Syntax Write what you want to match (if it's alpha-numeric). ```r lunch = c("one app", "two appetizers", "three apples") str_view_all(lunch, 'apple') ```
--- class: bg-green ## Match multiple things: wildcard `.` is a generic wildcard, matches any character. ```r str_view_all(lunch, 'app.') ```
--- class: bg-green ## Match multiple things: character class `[<set>]` is a character class, matches all characters in `<set>`. Specify a range of characters with `[<char>-<char>]`. Invert a class with `[^<set>]`. ```r str_view_all(lunch, 'app[le]') ```
--- class: bg-green ## Match multiple things: disjunction `(<x>|<y>)` is a disjunction, matches `<x>` or `<y>`. ```r str_view_all(lunch, 'app(le|etizer)s') ```
--- class: bg-green ## Qualifiers modify matches 1. `*` zero or more matches 2. `?` zero or one matches 3. `+` one or more matches 4. `{min,max}` to match between min-max times. Compare back to `"app."`, which didn't match the first string. ```r str_view_all(lunch, 'app.*') ```
--- ## Match without consuming with zero-width identifiers * `^` matches a zero-width "character" present at the start of all lines. * `$` is the analogous character at the end * `\b` is between "words". For example, the string: `red tired` can be thought as .darkgreen[^\b]red.darkgreen[\b] .darkgreen[\b]tired.darkgreen[\b$] --- class: bg-green ## Require word boundary We must double the `\` to keep R from interpreting it as an escape character. ```r str_view_all("red tired", "\\bred\\b") ```
--- class: bg-green ## Match unconditionally ```r str_view_all("red tired", "red") ```
--- ## Using regular expressions * Test for an expression `str_detect()`. * Return first `str_extract()` or all `str_extract_all()` matching portions of string. * Return first `str_match()` or all `str_match_all()` matching portions of string **and capture groups**. * Replace first `str_replace()` or all `str_replace_all()` matching portions of string and capture groups. --- ## `str_detect()` ```r str_detect(string = c("A", "AA", "AB", "B"), pattern = "A") ``` ``` ## [1] TRUE TRUE TRUE FALSE ``` ```r str_detect(string = lunch, pattern = 'app.') ``` ``` ## [1] FALSE TRUE TRUE ``` --- ## `str_extract()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_extract(string = feline, pattern = "cat") ``` ``` ## [1] "cat" "cat" NA ``` ```r str_extract_all(string = feline, pattern = "cat") ``` ``` ## [[1]] ## [1] "cat" ## ## [[2]] ## [1] "cat" "cat" ## ## [[3]] ## character(0) ``` --- ## `str_match()` For simple queries, behaves like `str_extract()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_match(feline, "cat") ``` ``` ## [,1] ## [1,] "cat" ## [2,] "cat" ## [3,] NA ``` --- ## `str_match()` But returns **capture groups** `(<expression>)` separately. ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_match(feline, "(\\w*) cat.? (\\w*)") ``` ``` ## [,1] [,2] [,3] ## [1,] "of cats goes" "of" "goes" ## [2,] "the cat) is" "the" "is" ## [3,] NA NA NA ``` `\w = [A-Za-z0-9_]`, we must double the `\` to keep R from interpreting it as an escape character. --- ## `str_match_all()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_match_all(feline, "(\\w*) cat.? (\\w*)") ``` ``` ## [[1]] ## [,1] [,2] [,3] ## [1,] "of cats goes" "of" "goes" ## ## [[2]] ## [,1] [,2] [,3] ## [1,] "the cat) is" "the" "is" ## [2,] "a cat with" "a" "with" ## ## [[3]] ## [,1] [,2] [,3] ``` --- ## `str_replace()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_replace(feline, "cat", "murder machine") ``` ``` ## [1] "The fur of murder machines goes by many names." ## [2] "Infimum (the murder machine) is a cat with a most baleful meow." ## [3] "Dog." ``` --- ## `str_replace_all()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_replace_all(feline, "cat", "murder machine") ``` ``` ## [1] "The fur of murder machines goes by many names." ## [2] "Infimum (the murder machine) is a murder machine with a most baleful meow." ## [3] "Dog." ``` --- ## `str_replace()` also can use capture groups Use `\1` to refer to the first capture group, `\2` for the second, etc. Note the `\\` because `\` must be escaped in R. ```r str_replace_all(feline, "(\\w*)", "\\1\\1") ``` ``` ## [1] "TheThe furfur ofof catscats goesgoes byby manymany namesnames." ## [2] "InfimumInfimum (thethe catcat) isis aa catcat withwith aa mostmost balefulbaleful meowmeow." ## [3] "DogDog." ``` Actually, any regular expression can use a capture group, both for matching and replacing! --- ## Application exercise .hand[Let's try]. If devtools works for you, you might also experiment with installing `devtools::install_github("gadenbuie/regexplain")`. See https://www.garrickadenbuie.com/project/regexplain/ for example usage. We'll use the dataset `words`, which is loaded automatically when you load the `stringr` package. It contains 980 common English words. ```r word_df = tibble(word = words) ``` --- How many words contain an "x" anywhere in them? List them. -- ```r filter(word_df, str_detect(word, "x")) ``` ``` ## # A tibble: 17 x 1 ## word ## <chr> ## 1 box ## 2 exact ## 3 example ## 4 except ## 5 excuse ## 6 exercise ## 7 exist ## 8 expect ## # ... with 9 more rows ``` --- How many words end in "x"? List them. Use `$` to match the end of the string. -- ```r filter(word_df, str_detect(word, "x$")) ``` ``` ## # A tibble: 4 x 1 ## word ## <chr> ## 1 box ## 2 sex ## 3 six ## 4 tax ``` --- Do any words start with "x"? Use `^` to match the start of the string. -- ```r filter(word_df, str_detect(word, "^x")) ``` ``` ## # A tibble: 0 x 1 ## # ... with 1 variable: word <chr> ``` --- Using wildcards `.` and quantifiers `+` (rather than the results of the previous exercises), find all the words that contain "x" in the interior (but not at the start or end). Check that the number of results from 1-4 are coherent. -- ```r filter(word_df, str_detect(word, '.+x.+')) ``` ``` ## # A tibble: 13 x 1 ## word ## <chr> ## 1 exact ## 2 example ## 3 except ## 4 excuse ## 5 exercise ## 6 exist ## 7 expect ## 8 expense ## # ... with 5 more rows ``` --- On average, how many vowels are there per word? (Hint: use `str_count` and `[]` to define a character class). What is the average vowel-per-letter (# of vowels normalized per length) -- ```r word_df %>% mutate(vowels = str_count(word, "[aeiouAEIOU]"), vowels_per_letter = vowels/nchar(word)) %>% summarise(mean(vowels), mean(vowels_per_letter)) ``` ``` ## # A tibble: 1 x 2 ## `mean(vowels)` `mean(vowels_per_letter)` ## <dbl> <dbl> ## 1 1.99 0.380 ``` --- List all the words with three or more vowels in a row. Use `{min_matches,max_matches}` as a quantifier. -- ```r filter(word_df, str_detect(word, '[aeiou]{3,10}')) ``` ``` ## # A tibble: 6 x 1 ## word ## <chr> ## 1 beauty ## 2 obvious ## 3 previous ## 4 quiet ## 5 serious ## 6 various ``` --- ## Sentences Now, consider the in the `sentences` data set: ```r sentence_df = tibble(sentence = sentences) sentence_df ``` ``` ## # A tibble: 720 x 1 ## sentence ## <chr> ## 1 The birch canoe slid on the smooth planks. ## 2 Glue the sheet to the dark blue background. ## 3 It's easy to tell the depth of a well. ## 4 These days a chicken leg is a rare dish. ## 5 Rice is often served in round bowls. ## 6 The juice of lemons makes fine punch. ## 7 The box was thrown beside the parked truck. ## 8 The hogs were fed chopped corn and garbage. ## # ... with 712 more rows ``` --- Extract the first word from each sentence. Hint: negate the space character class "[ ]" to match everything except a space. -- ```r sentence_df %>% mutate(first_word = str_extract(sentence, '\\w*')) %>% head(3)# doesn't include apostrophes ''' ``` ``` ## # A tibble: 3 x 2 ## sentence first_word ## <chr> <chr> ## 1 The birch canoe slid on the smooth planks. The ## 2 Glue the sheet to the dark blue background. Glue ## 3 It's easy to tell the depth of a well. It ``` ```r sentence_df %>% mutate(first_word = str_extract(sentence, '([^ ])*')) ``` ``` ## # A tibble: 720 x 2 ## sentence first_word ## <chr> <chr> ## 1 The birch canoe slid on the smooth planks. The ## 2 Glue the sheet to the dark blue background. Glue ## 3 It's easy to tell the depth of a well. It's ## 4 These days a chicken leg is a rare dish. These ## 5 Rice is often served in round bowls. Rice ## 6 The juice of lemons makes fine punch. The ## 7 The box was thrown beside the parked truck. The ## 8 The hogs were fed chopped corn and garbage. The ## # ... with 712 more rows ``` --- Return all the sentences that contain the colors "red", "blue" or "green". Use the `|` disjunction. -- ```r filter(sentence_df, str_detect(sentence, ("red|blue|green"))) ``` ``` ## # A tibble: 54 x 1 ## sentence ## <chr> ## 1 Glue the sheet to the dark blue background. ## 2 Two blue fish swam in the tank. ## 3 The colt reared and threw the tall rider. ## 4 The wide road shimmered in the hot sun. ## 5 See the cat glaring at the scared mouse. ## 6 A wisp of cloud hung in the blue air. ## 7 He ordered peach pie with ice cream. ## 8 Pure bred poodles have curls. ## # ... with 46 more rows ``` --- Extract the first word ending in "s". Use a capture group `()`, `str_match()` and the everything-but-space `[^ ]` character class. -- ```r sentence_df = sentence_df %>% mutate(ends_in_s = str_match(sentence, "([^ ]*s)\\b")[,2]) sentence_df ``` ``` ## # A tibble: 720 x 2 ## sentence ends_in_s ## <chr> <chr> ## 1 The birch canoe slid on the smooth planks. planks ## 2 Glue the sheet to the dark blue background. <NA> ## 3 It's easy to tell the depth of a well. It's ## 4 These days a chicken leg is a rare dish. days ## 5 Rice is often served in round bowls. is ## 6 The juice of lemons makes fine punch. lemons ## 7 The box was thrown beside the parked truck. was ## 8 The hogs were fed chopped corn and garbage. hogs ## # ... with 712 more rows ``` --- (Stretch goal) Notice that two questions ago, we also matched the sentence > The colt reared and threw the tall rider. because "reared" contains "red". Fix the regular expression so it only matches the complete words, not just a fragment using the "\b" word start marker. Hint: use "\\\\b" to keep the backslash from being used as an escape character. -- ```r filter(sentence_df, str_detect(sentence, ("\\bred\\b|\\bblue\\b|\\bgreen\\b"))) ``` ``` ## # A tibble: 26 x 2 ## sentence ends_in_s ## <chr> <chr> ## 1 Glue the sheet to the dark blue background. <NA> ## 2 Two blue fish swam in the tank. <NA> ## 3 A wisp of cloud hung in the blue air. <NA> ## 4 The spot on the blotter was made by green ink. was ## 5 The sofa cushion is red and of light weight. is ## 6 The sky that morning was clear and bright blue. was ## 7 A blue crane is a tall wading bird. is ## 8 It is hard to erase blue or red ink. is ## # ... with 18 more rows ``` <!-- --- --> <!-- class: middle --> <!--  --> <!-- Text mining using `tidytext` --> <!-- --- --> <!-- ## Text mining using `tidytext` --> <!-- Text is inherently high-dimensional and noisy data. We could spent weeks on this. Instead, we'll have to be content to know what we don't know: --> <!-- * Sampling text data and its potential ascertainment biases --> <!-- * Handling non-Roman (ASCII) characters --> <!-- * Parsing into tokens --> <!-- * Filtering low-content words --> <!-- * Dimension reduction, e.g., latent Dirichlet allocation or non-negative matrix factorization --> <!-- * Embeddings using pre-trained neural networks --> <!-- Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/). --> <!-- --- --> <!-- ## Most important functionality --> <!-- * `unnest_tokens()` split a string into tokens (words, bi-grams, etc) as a data frame --> <!-- * `bind_tf_idf` calculate term and inverse-document frequencies. --> <!-- * `cast_sparse` convert to a (sparse) document-term matrix. --> <!-- --- --> <!-- ## Austin vs Kafka --> <!-- ```{r} --> <!-- library(tidytext) --> <!-- book_names = tibble(gutenberg_id = c(158, 1342, 5200, 7849), --> <!-- title = c('Emma', 'Pride and prejudice', --> <!-- 'Metamorphosis', 'The Trial')) --> <!-- books = gutenbergr::gutenberg_download(book_names$gutenberg_id) %>% left_join(book_names) --> <!-- ``` --> <!-- .scroll-box-10[ --> <!-- ```{r, output.lines = 24} --> <!-- books %>% group_by(title) %>% slice_head(n=6) --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## Get words --> <!-- ```{r} --> <!-- book_words = unnest_tokens(books, text, output = 'word', drop = TRUE) --> <!-- book_words --> <!-- ``` --> <!-- --- --> <!-- ## Count words by book --> <!-- ```{r} --> <!-- word_counts = book_words %>% --> <!-- group_by(title) %>% count(title, word) %>% --> <!-- arrange(desc(n)) --> <!-- word_counts %>% slice_head(n = 3) --> <!-- ``` --> <!-- --- --> <!-- ## Remove "stop" words --> <!-- Stop words are common, low-semantic value words. Sometimes useful to remove. --> <!-- ```{r} --> <!-- word_counts %>% anti_join(get_stopwords()) %>% slice_head(n = 3) --> <!-- ``` --> <!-- --- --> <!-- ## Term frequency in Kafka vs Austin --> <!-- ```{r} --> <!-- total_words = word_counts %>% --> <!-- group_by(title) %>% --> <!-- summarize(total = sum(n)) --> <!-- word_counts = left_join(word_counts, total_words) --> <!-- word_counts --> <!-- ``` --> <!-- --- --> <!-- ## Term frequency in Kafka vs Austin --> <!-- ```{r plottf, warning = FALSE} --> <!-- ggplot(word_counts, aes(n/total)) + --> <!-- geom_histogram(show.legend = FALSE) + --> <!-- xlim(NA, 0.0009) + --> <!-- facet_wrap(~title, ncol = 2, scales = "free_y") + theme_minimal() --> <!-- ``` --> <!-- --- --> <!-- ## Zipf's law --> <!-- Distributions like those on the previous slide are typical in language. A classic version of this relationship is called Zipf's law. --> <!-- > Zipf's law states that the frequency that a word appears is inversely proportional to its rank. --> <!-- --- --> <!-- ## Zipf's law --> <!-- .panelset[ --> <!-- .panel[.panel-name[Code] --> <!-- ```{r freq_by_rank, fig.show='hide'} --> <!-- freq_by_rank = word_counts %>% --> <!-- group_by(title) %>% --> <!-- mutate(rank = row_number(), --> <!-- `term frequency` = n/total) %>% --> <!-- ungroup() --> <!-- freq_by_rank %>% --> <!-- ggplot(aes(x = rank, y = `term frequency`, color = title)) + --> <!-- geom_abline(intercept = -0.62, slope = -1, --> <!-- color = "gray50", linetype = 2) + --> <!-- geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + --> <!-- scale_x_log10() + --> <!-- scale_y_log10() + --> <!-- theme_minimal() --> <!-- ``` --> <!-- ] --> <!-- .panel[.panel-name[Plot] --> <!-- ```{r ref.label = 'freq_by_rank', echo = FALSE} --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- --- --> <!-- ## Sentiment analysis --> <!-- ```{r} --> <!-- word_sentiments = word_counts %>% --> <!-- left_join(sentiments) %>% #<< --> <!-- filter(!is.na(sentiment)) %>% --> <!-- group_by(title) %>% --> <!-- mutate(word_collapse = fct_lump_n(word, n = 10, w = n), --> <!-- word_collapse = fct_reorder(word_collapse, n, sum)) %>% --> <!-- select(title, word_collapse, sentiment, n) --> <!-- word_sentiments --> <!-- ``` --> <!-- --- --> <!-- ## Which is more happy? --> <!-- ```{r} --> <!-- ggplot(word_sentiments, aes(y = fct_reorder(word_collapse, n, .fun = sum), x = n, fill = sentiment)) + geom_col() + facet_wrap(~title, scales = 'free_x') + ylab("Word") + xlab("Occurrence") + theme_minimal() --> <!-- ``` --> <!-- --- --> <!-- ## Term frequency and inverse document frequency --> <!-- The inverse document frequency is --> <!-- `$$\text{idf}(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$` --> <!-- The IDF thus ranges from 0 for words that appear in every document up to `\(log(n)\)` for a word unique across documents. --> <!-- The term frequency is just the word counts, normalized to the number of words per text, so the popular TF-IDF<sup>1</sup> metric is just --> <!-- `$$\text{tf-idf}(\text{term}) = \text{idf}(\text{term}) \times \text{tf}(\text{term})$$` --> <!-- .footnote[[1] Popular, and curiously devoid of an obvious statistical model. [Some attempts to link to information theory](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Justification_of_idf) have been made.] --> <!-- --- --> <!-- ## Calculate TF-IDF --> <!-- ```{r calc-tf, output.lines = 12} --> <!-- word_counts = word_counts %>% bind_tf_idf(word, title, n) --> <!-- word_counts --> <!-- ``` --> <!-- --- --> <!-- ## TF-IDF of Kafka and Austen --> <!-- This words relatively well to identify signature words -- some represent content, some represent author style (e.g. contractions used by Kafka) --> <!-- ```{r, echo = FALSE} --> <!-- word_counts %>% group_by(title) %>% slice_max(tf_idf, n = 15) %>% ungroup() %>% --> <!-- mutate(word = reorder(word, tf_idf)) %>% --> <!-- ggplot(aes(tf_idf, word)) + --> <!-- geom_col(show.legend = FALSE) + --> <!-- labs(x = "tf-idf", y = NULL) + --> <!-- facet_wrap(~title, ncol = 2, scales = "free") + --> <!-- theme_minimal() --> <!-- ``` --> <!-- --- --> <!-- ## Occurrence matrix --> <!-- Lastly, we might want to convert our counts to an occurrence matrix `\(\mathbf X = [x_{ij}]\)` where `\(x_{ij}\)` is the number of times document `\(i\)` contains term `\(j\)`. --> <!-- Most `\(x_{ij}\)` will be zero, reflecting Zipf's law. We will almost always want to store it in a special format called a .alert[sparse matrix], that only stores the non-zero entries and their index in the matrix. --> <!-- --- --> <!-- ## `cast_sparse()` --> <!-- ```{r} --> <!-- X = cast_sparse(word_counts, title, word, n) --> <!-- class(X) --> <!-- dim(X) --> <!-- sum(X>0) --> <!-- ``` --> <!-- This is useful for downstream modeling, such as latent Dirichlet allocation. --> <!-- --- --> <!-- # Resources --> <!-- Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/). -->