BST430 Lecture 10

class: ur-title, center, middle, title-slide

.title[
# BST430 Lecture 10
]
.subtitle[
## Text Processing
]
.author[
### Tanzy Love, based on the course by Andrew McDavid
]
.institute[
### U of Rochester
]
.date[
### 2021-09-26 (updated: 2022-10-13 by TL)
]

---

Here's the [R code in this lecture](l10/l10-text.R)

Here's the [cardiac-dx](l10/data/cardiac-dx.csv) and 
[sample_sheet2](l10/ae/data/sample_sheet2.csv)

*You have to have a good filepath to each dataset*

---

## Cardiac diagnoses

```r
diagnoses = read_csv('l10/data/cardiac-dx.csv')
diagnoses
```

```
## # A tibble: 100 x 3
##      id gender diagnoses                                         
##   <dbl> <chr>  <chr>                                             
## 1 26108 F      {a,d,d} | abdominal situs ambiguous (abdominal he~
## 2 10949 M      {s,d,l} | aortic stenosis - valvar | atrial septa~
## 3  8090 F      {s,l,l} | aortic valve position relative to the p~
## 4 19800 M      {s,l,l} | crisscross atrioventricular valves | de~
## 5  2708 M      aberrant left subclavian artery | hypoplastic mai~
## 6 14031 F      absence of the suprarenal inferior vena cava with~
## 7  4185 M      aortic arch hypoplasia | aortic atresia | coronar~
## 8   313 F      aortic arch hypoplasia | aortic atresia | hypopla~
## # ... with 92 more rows
```

.question[How to get all the diagnoses of hypoplastic left heart?]

---

## Attempt 1

```r
diagnoses %>% count(diagnoses) %>% 
  arrange(desc(n))
```

```
## # A tibble: 67 x 2
##   diagnoses                                                     n
##   <chr>                                                     <int>
## 1 screener diagnosis: none                                     26
## 2 hypoplastic left heart syndrome | screener diagnosis: no~     5
## 3 atrial septal defect, secundum | screener diagnosis: atr~     3
## 4 aortic atresia | hypoplastic left heart syndrome | mitra~     2
## 5 screener diagnosis: none | tetralogy of fallot                2
## 6 {a,d,d} | abdominal situs ambiguous (abdominal heterotax~     1
## 7 {s,d,l} | aortic stenosis - valvar | atrial septal defec~     1
## 8 {s,l,l} | aortic valve position relative to the pulmonar~     1
## # ... with 59 more rows
```

Too many unique diagnoses to make headway with this.

---

## Using string matching

```r
filter(diagnoses, str_detect(diagnoses, 'hypoplastic'))
```

```
## # A tibble: 50 x 3
##      id gender diagnoses                                         
##   <dbl> <chr>  <chr>                                             
## 1 26108 F      {a,d,d} | abdominal situs ambiguous (abdominal he~
## 2 10949 M      {s,d,l} | aortic stenosis - valvar | atrial septa~
## 3 19800 M      {s,l,l} | crisscross atrioventricular valves | de~
## 4  2708 M      aberrant left subclavian artery | hypoplastic mai~
## 5 14031 F      absence of the suprarenal inferior vena cava with~
## 6  4185 M      aortic arch hypoplasia | aortic atresia | coronar~
## 7   313 F      aortic arch hypoplasia | aortic atresia | hypopla~
## 8 10792 M      aortic arch hypoplasia | aortic atresia | hypopla~
## # ... with 42 more rows
```

.question[What different sort of hypoplasties are there?]

---

## Split into pieces

```r
diagnoses_row = diagnoses %>% 
  tidyr::separate_rows(diagnoses, 
*                      sep = " \\| ") #WTH?
diagnoses_row
```

```
## # A tibble: 412 x 3
##      id gender diagnoses                                         
##   <dbl> <chr>  <chr>                                             
## 1 26108 F      {a,d,d}                                           
## 2 26108 F      abdominal situs ambiguous (abdominal heterotaxy)  
## 3 26108 F      aortic atresia                                    
## 4 26108 F      double outlet right ventricle                     
## 5 26108 F      heterotaxy syndrome                               
## 6 26108 F      inferior vena cava, left sided                    
## 7 26108 F      screener diagnosis: other: cavc, aa, hypoplastic ~
...
```

---

## Split into pieces

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
  count(diagnoses) %>% arrange(desc(n))
```

```
## # A tibble: 15 x 2
##   diagnoses                                                 n
##   <chr>                                                 <int>
## 1 hypoplastic left heart syndrome                          21
## 2 hypoplastic right ventricle (subnormal cavity volume)     8
## 3 hypoplastic left ventricle (subnormal cavity volume)      7
## 4 hypoplastic mitral valve                                  7
## 5 hypoplastic left pulmonary artery                         6
## 6 hypoplastic main pulmonary artery                         6
## 7 hypoplastic right pulmonary artery                        5
## 8 hypoplastic tricuspid valve                               3
## # ... with 7 more rows
```

---

## Plot co-occurrence

.panelset[

.panel[.panel-name[Code]

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
  ggplot(aes(y = diagnoses, x = as.factor(id))) + 
  geom_tile() + 
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

### Sensible factor orders

.panelset[

.panel[.panel-name[Code]

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
* ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) +
  geom_tile() + 
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-9-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

### Wrap text width

.panelset[

.panel[.panel-name[Code]

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
* mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>%
  ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + 
  geom_tile() + 
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-10-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

### Adjust text size and justification

.panelset[

.panel[.panel-name[Code]

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
  mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>%
  ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + 
  geom_tile() + 
* theme(axis.text.y = element_text(hjust = 0, vjust = 0, size = 8)) +
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-11-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

# Topics

*  Why text processing
*  Low level processing:
  *  concatenate, count characters, substring, split strings.
*  Regular expressions (aka regex):
  *  detect, extract, replace
*  Text mining
  *  Tokenizing, filtering, analysis

---

## Low level text processing

*  concatenate with `stringr::str_c()` and `glue::glue()`
*  count characters with `base::nchar()`
*  extract and replace substrings with `stringr::str_sub()`
*  split with `str_split_fixed()` (generally) or `str_split()` (less often)

---

## Packages

.pull-left[
`stringr` and `glue` rationalize much of text processing, which is otherwise a bit of a thicket in R.
]
.pull-right[
![](https://raw.githubusercontent.com/tidyverse/stringr/master/man/figures/logo.png)
![](https://raw.githubusercontent.com/tidyverse/glue/master/man/figures/logo.png)

]

---

## Concatenate strings

.pull-left[

```r
names = c("Jeff B.", "Larry E.", "Warren B.")
favorite_food = c("caviar", "cake", "Pappy Van Winkle")
str_c(names, 
      " likes ", #note additional spaces
      favorite_food, ".")
```

```
## [1] "Jeff B. likes caviar."            
## [2] "Larry E. likes cake."             
## [3] "Warren B. likes Pappy Van Winkle."
```
]
.pull-right[

```r
dinner = glue::glue("{names} likes {favorite_food}.")
dinner
```

```
## Jeff B. likes caviar.
## Larry E. likes cake.
## Warren B. likes Pappy Van Winkle.
```
]

---

## Some special characters

*  \n newline
*  \r carriage return
*  \t tab
*  \f form feed
*  \Unnnnnnnn  Unicode character with given code
*  \\\ literal backslash (.alert[this one will prove to be especially annoying...])
*  \" literal quote

Others are listed in `?'"'` (the help on the quote function).

---

## Glue with newlines and unicode

```r
glue::glue("{names} \n {favorite_food} \U1F643.")
```

```
## Jeff B. 
## caviar <U+0001F643>.
## Larry E. 
## cake <U+0001F643>.
## Warren B. 
## Pappy Van Winkle <U+0001F643>.
```

---

## Count characters

```r
names
```

```
## [1] "Jeff B."   "Larry E."  "Warren B."
```

```r
nchar(names)
```

```
## [1] 7 8 9
```

---

## Extract substrings

.pull-left[
Extract

```r
str_sub(dinner, 1, 11)
```

```
## [1] "Jeff B. lik" "Larry E. li" "Warren B. l"
```
]
.pull-right[
Replace

```r
str_sub(dinner, 
        #space + l-
        nchar(names) + 2, 
        #space + l-i-k-e
        nchar(names) + 6 
) = "demands"
dinner
```

```
## [1] "Jeff B. demands caviar."            
## [2] "Larry E. demands cake."             
## [3] "Warren B. demands Pappy Van Winkle."
```
]

---

## split strings

Get a character matrix, padding / collapsing excess fields.

```r
str_split_fixed(dinner, " ",  4)
```

```
##      [,1]     [,2] [,3]      [,4]               
## [1,] "Jeff"   "B." "demands" "caviar."          
## [2,] "Larry"  "E." "demands" "cake."            
## [3,] "Warren" "B." "demands" "Pappy Van Winkle."
```

```r
str_split_fixed(dinner, " ", 6)
```

```
##      [,1]     [,2] [,3]      [,4]      [,5]  [,6]     
## [1,] "Jeff"   "B." "demands" "caviar." ""    ""       
## [2,] "Larry"  "E." "demands" "cake."   ""    ""       
## [3,] "Warren" "B." "demands" "Pappy"   "Van" "Winkle."
```

---

## split strings

Get exactly what you ask for.

```r
str_split(dinner, " ")
```

```
## [[1]]
## [1] "Jeff"    "B."      "demands" "caviar."
## 
## [[2]]
## [1] "Larry"   "E."      "demands" "cake."  
## 
## [[3]]
## [1] "Warren"  "B."      "demands" "Pappy"   "Van"     "Winkle."
```

Also recall `tidyr::separate` and `tidyr::separate_rows`.

---

## Other handy low-level string manipulations

*  Change case `str_to_lower()`, `str_to_upper()`, `str_to_title()`
*  Remove trailing/leading `str_trim()` or repeated `str_squish()` whitespace
*  Wrap long lines `stringr::str_wrap()`
*  Truncate `str_trunc()` or abbreviate `base::abbreviate()` long strings.

---

## Application Exercise

.hand[Let's try.]

Read some data

```r
data = read_csv("l10/ae/data/sample_sheet2.csv")
data
```

```
## # A tibble: 24 x 6
##   library_id folder_name          treat~1 sampl~2 tissu~3 dataset
##        <dbl> <chr>                  <dbl> <chr>   <chr>   <chr>  
## 1          1 Sample_300_0150_PBL~     300 300_01~ PBL     300_01~
## 2          2 Sample_300_0150_Syn~     300 300_01~ Syn     300_01~
## 3          4 Sample_300_0171_PBL~     300 300_01~ PBL     300_01~
## 4          5 Sample_300_0171_Syn~     300 300_01~ Syn     300_01~
## 5          6 Sample_300_0173_PBL~     300 300_01~ PBL     300_01~
## 6          7 Sample_300_0173_Syn~     300 300_01~ Syn     300_01~
## 7          8 Sample_300_0174_PBL~     300 300_01~ PBL     300_01~
## 8          9 Sample_300_0174_Syn~     300 300_01~ Syn     300_01~
## # ... with 16 more rows, and abbreviated variable names
## #   1: treatment, 2: sample_ID, 3: tissue_source
```
---
Split the `folder_name` field by the '_' character.  This will give you a character `matrix`.

```r
char_matrix = str_split_fixed(data$folder_name, pattern = "_", n = 6)
char_matrix
```

```
##       [,1]     [,2]  [,3]   [,4]  [,5] [,6]
##  [1,] "Sample" "300" "0150" "PBL" "BT" "5" 
##  [2,] "Sample" "300" "0150" "Syn" "BT" "5" 
##  [3,] "Sample" "300" "0171" "PBL" "BT" "5" 
##  [4,] "Sample" "300" "0171" "Syn" "BT" "5" 
##  [5,] "Sample" "300" "0173" "PBL" "BT" "5" 
##  [6,] "Sample" "300" "0173" "Syn" "BT" "5" 
##  [7,] "Sample" "300" "0174" "PBL" "BT" "5" 
##  [8,] "Sample" "300" "0174" "Syn" "BT" "5" 
##  [9,] "Sample" "300" "0392" "PBL" "BT" "5" 
## [10,] "Sample" "300" "0392" "Syn" "BT" "5" 
## [11,] "Sample" "300" "0410" "PBL" "BT" "5" 
## [12,] "Sample" "300" "0410" "Syn" "BT" "5" 
## [13,] "Sample" "300" "0414" "PBL" "BT" "5" 
## [14,] "Sample" "300" "0414" "Syn" "BT" "5" 
## [15,] "Sample" "300" "0415" "Syn" "BT" "5" 
## [16,] "Sample" "300" "0416" "Syn" "BT" "5" 
## [17,] "Sample" "300" "1883" "Syn" "BT" "5" 
## [18,] "Sample" "300" "1930" "PBL" "BT" "5" 
## [19,] "Sample" "300" "1930" "Syn" "BT" "5" 
## [20,] "Sample" "301" "0174" "PBL" "BT" "5" 
## [21,] "Sample" "301" "0174" "Syn" "BT" "5" 
## [22,] "Sample" "301" "0174" "Syn" "PC" "5" 
## [23,] "Sample" "301" "0270" "PBL" "BT" "5" 
## [24,] "Sample" "301" "0270" "Syn" "BT" "5"
```

---
Using this and `str_c`, recreate the fields `treatment`, `sample_ID`, `tissue_source`.
You can extract various columns using `char_matrix[,integer_of_the_column_you_want]`.

```r
data %>% select(folder_name) %>% 
  mutate(treatment = char_matrix[,2],
         sample_ID = str_c(char_matrix[,2], "_", char_matrix[,3]),
         tissue_source = char_matrix[,4])
```

```
## # A tibble: 24 x 4
##   folder_name              treatment sample_ID tissue_source
##   <chr>                    <chr>     <chr>     <chr>        
## 1 Sample_300_0150_PBL_BT_5 300       300_0150  PBL          
## 2 Sample_300_0150_Syn_BT_5 300       300_0150  Syn          
## 3 Sample_300_0171_PBL_BT_5 300       300_0171  PBL          
## 4 Sample_300_0171_Syn_BT_5 300       300_0171  Syn          
## 5 Sample_300_0173_PBL_BT_5 300       300_0173  PBL          
## 6 Sample_300_0173_Syn_BT_5 300       300_0173  Syn          
## 7 Sample_300_0174_PBL_BT_5 300       300_0174  PBL          
## 8 Sample_300_0174_Syn_BT_5 300       300_0174  Syn          
## # ... with 16 more rows
```

---
Convert the `dataset` field to be entirely lowercase with `str_to_lower`.

```r
data %>% select(dataset) %>% 
  mutate(new_data = str_to_lower(dataset))
```

```
## # A tibble: 24 x 2
##   dataset        new_data      
##   <chr>          <chr>         
## 1 300_0150_PBL_1 300_0150_pbl_1
## 2 300_0150_Syn_2 300_0150_syn_2
## 3 300_0171_PBL_4 300_0171_pbl_4
## 4 300_0171_Syn_5 300_0171_syn_5
## 5 300_0173_PBL_6 300_0173_pbl_6
## 6 300_0173_Syn_7 300_0173_syn_7
## 7 300_0174_PBL_8 300_0174_pbl_8
## 8 300_0174_Syn_9 300_0174_syn_9
## # ... with 16 more rows
```

---
Collapse sample_ID field into a single character vector (length 1) separated by semicolons ";" using `str_c(..., collapse = ...)`.

```r
str_c(data$sample_ID, collapse=";")
```

```
## [1] "300_0150;300_0150;300_0171;300_0171;300_0173;300_0173;300_0174;300_0174;300_0392;300_0392;300_0410;300_0410;300_0414;300_0414;300_0415;300_0416;300_1883;300_1930;300_1930;301_0174;301_0174;301_0174;301_0270;301_0270"
```

---

class: middle

.hand[Regular Expressions]:

---

## Regular expressions

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski (creator of Mozilla)

*  Are like *find-replace*, *wildcards* \* 
*  Are found in nearly every computer language
*  Can be just the ticket to solving *some* problems

---
class: bg-green

## Syntax

Write what you want to match (if it's alpha-numeric).

```r
lunch = c("one app", "two appetizers", "three apples")
str_view_all(lunch, 'apple')
```

<div id="htmlwidget-1b4ff99564eb6e8884a5" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-1b4ff99564eb6e8884a5">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two appetizers<\/li>\n  <li>three <span class='match'>apple<\/span>s<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match multiple things: wildcard

`.` is a generic wildcard, matches any character.

```r
str_view_all(lunch, 'app.')
```

<div id="htmlwidget-8f836166d559454ecd73" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-8f836166d559454ecd73">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two <span class='match'>appe<\/span>tizers<\/li>\n  <li>three <span class='match'>appl<\/span>es<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match multiple things: character class

`[<set>]` is a character class, matches all characters in `<set>`.  Specify a range of characters with `[<char>-<char>]`.  Invert a class with `[^<set>]`.

```r
str_view_all(lunch, 'app[le]')
```

<div id="htmlwidget-0d7db73d03839427d3b7" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-0d7db73d03839427d3b7">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two <span class='match'>appe<\/span>tizers<\/li>\n  <li>three <span class='match'>appl<\/span>es<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match multiple things: disjunction

`(<x>|<y>)` is a disjunction, matches `<x>` or `<y>`.

```r
str_view_all(lunch, 'app(le|etizer)s')
```

<div id="htmlwidget-c2e3fede5c798441fdae" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-c2e3fede5c798441fdae">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two <span class='match'>appetizers<\/span><\/li>\n  <li>three <span class='match'>apples<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Qualifiers modify matches

1. `*` zero or more matches
2. `?` zero or one matches
3. `+` one or more matches
4. `{min,max}` to match between min-max times.

Compare back to `"app."`, which didn't match the first string.

```r
str_view_all(lunch, 'app.*')
```

<div id="htmlwidget-a77ad32f6b82562872f5" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-a77ad32f6b82562872f5">{"x":{"html":"<ul>\n  <li>one <span class='match'>app<\/span><\/li>\n  <li>two <span class='match'>appetizers<\/span><\/li>\n  <li>three <span class='match'>apples<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Match without consuming with zero-width identifiers

*  `^` matches a zero-width "character" present at the start of all lines.
*  `$` is the analogous character at the end
*  `\b` is between "words".

For example, the string:

`red tired`

can be thought as

.darkgreen[^\b]red.darkgreen[\b] .darkgreen[\b]tired.darkgreen[\b$]

---
class: bg-green

## Require word boundary

We must double the `\` to keep R from interpreting it as an escape character.

```r
str_view_all("red tired", "\\bred\\b")
```

<div id="htmlwidget-e3479f292a01744eb530" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-e3479f292a01744eb530">{"x":{"html":"<ul>\n  <li><span class='match'>red<\/span> tired<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match unconditionally

```r
str_view_all("red tired", "red")
```

<div id="htmlwidget-65b7328530dcdf303887" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-65b7328530dcdf303887">{"x":{"html":"<ul>\n  <li><span class='match'>red<\/span> ti<span class='match'>red<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Using regular expressions

*  Test for an expression `str_detect()`.
*  Return first `str_extract()` or all `str_extract_all()` matching portions of string.
*  Return first `str_match()` or all `str_match_all()` matching portions of string **and capture groups**.
*  Replace first `str_replace()`  or all `str_replace_all()` matching portions of string and capture groups.

---

## `str_detect()`

```r
str_detect(string = c("A", "AA", "AB", "B"), 
           pattern = "A")
```

```
## [1]  TRUE  TRUE  TRUE FALSE
```

```r
str_detect(string = lunch, pattern = 'app.')
```

```
## [1] FALSE  TRUE  TRUE
```

---

## `str_extract()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")
str_extract(string = feline, pattern = "cat")
```

```
## [1] "cat" "cat" NA
```

```r
str_extract_all(string = feline, pattern = "cat")
```

```
## [[1]]
## [1] "cat"
## 
## [[2]]
## [1] "cat" "cat"
## 
## [[3]]
## character(0)
```

---

## `str_match()`

For simple queries, behaves like `str_extract()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")
str_match(feline, "cat")
```

```
##      [,1] 
## [1,] "cat"
## [2,] "cat"
## [3,] NA
```

---

## `str_match()`

But returns **capture groups** `(<expression>)` separately.

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")
str_match(feline, "(\\w*) cat.? (\\w*)")
```

```
##      [,1]           [,2]  [,3]  
## [1,] "of cats goes" "of"  "goes"
## [2,] "the cat) is"  "the" "is"  
## [3,] NA             NA    NA
```

`\w = [A-Za-z0-9_]`, we must double the `\` to keep R from interpreting it as an escape character.

---

## `str_match_all()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")

str_match_all(feline, "(\\w*) cat.? (\\w*)")
```

```
## [[1]]
##      [,1]           [,2] [,3]  
## [1,] "of cats goes" "of" "goes"
## 
## [[2]]
##      [,1]          [,2]  [,3]  
## [1,] "the cat) is" "the" "is"  
## [2,] "a cat with"  "a"   "with"
## 
## [[3]]
##      [,1] [,2] [,3]
```

---

## `str_replace()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")

str_replace(feline, "cat", "murder machine")
```

```
## [1] "The fur of murder machines goes by many names."                 
## [2] "Infimum (the murder machine) is a cat with a most baleful meow."
## [3] "Dog."
```

---

## `str_replace_all()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")

str_replace_all(feline, "cat", "murder machine")
```

```
## [1] "The fur of murder machines goes by many names."                            
## [2] "Infimum (the murder machine) is a murder machine with a most baleful meow."
## [3] "Dog."
```

---

## `str_replace()` also can use capture groups

Use `\1` to refer to the first capture group, `\2` for the second, etc.  Note the `\\` because `\` must be escaped in R.

```r
str_replace_all(feline, "(\\w*)", "\\1\\1")
```

```
## [1] "TheThe furfur ofof catscats goesgoes byby manymany namesnames."                             
## [2] "InfimumInfimum (thethe catcat) isis aa catcat withwith aa mostmost balefulbaleful meowmeow."
## [3] "DogDog."
```

Actually, any regular expression can use a capture group, both for matching and replacing!
---

## Application exercise

.hand[Let's try].

If devtools works for you, you might also experiment with installing `devtools::install_github("gadenbuie/regexplain")`.  See https://www.garrickadenbuie.com/project/regexplain/ for example usage.

We'll use the dataset `words`, which is loaded automatically when you load the `stringr` package.  It contains 980 common English words.

```r
word_df = tibble(word = words)
```

---

How many words contain an "x" anywhere in them? List them.

```r
filter(word_df, str_detect(word, "x"))
```

```
## # A tibble: 17 x 1
##   word    
##   <chr>   
## 1 box     
## 2 exact   
## 3 example 
## 4 except  
## 5 excuse  
## 6 exercise
## 7 exist   
## 8 expect  
## # ... with 9 more rows
```

---

How many words end in "x"?  List them.  Use `$` to match the end of the string.

```r
filter(word_df, str_detect(word, "x$"))
```

```
## # A tibble: 4 x 1
##   word 
##   <chr>
## 1 box  
## 2 sex  
## 3 six  
## 4 tax
```

---

Do any words start with "x"?  Use `^` to match the start of the string.

```r
filter(word_df, str_detect(word, "^x"))
```

```
## # A tibble: 0 x 1
## # ... with 1 variable: word <chr>
```

---

Using wildcards `.` and quantifiers `+` (rather than the results of the previous exercises), find all the words that contain "x" in the interior (but not at the start or end).  Check that the number of results from 1-4 are coherent.

```r
filter(word_df, str_detect(word, '.+x.+'))
```

```
## # A tibble: 13 x 1
##   word    
##   <chr>   
## 1 exact   
## 2 example 
## 3 except  
## 4 excuse  
## 5 exercise
## 6 exist   
## 7 expect  
## 8 expense 
## # ... with 5 more rows
```

---

On average, how many vowels are there per word?  (Hint: use `str_count` and `[]` to define a character class).  What is the average vowel-per-letter (# of vowels normalized per length)

```r
word_df %>% 
  mutate(vowels = str_count(word, "[aeiouAEIOU]"),
         vowels_per_letter = vowels/nchar(word)) %>% 
  summarise(mean(vowels), mean(vowels_per_letter))
```

```
## # A tibble: 1 x 2
##   `mean(vowels)` `mean(vowels_per_letter)`
##            <dbl>                     <dbl>
## 1           1.99                     0.380
```

---
List all the words with three or more vowels in a row.  Use `{min_matches,max_matches}` as a quantifier.

```r
filter(word_df, str_detect(word, '[aeiou]{3,10}'))
```

```
## # A tibble: 6 x 1
##   word    
##   <chr>   
## 1 beauty  
## 2 obvious 
## 3 previous
## 4 quiet   
## 5 serious 
## 6 various
```

---
## Sentences

Now, consider the  in the `sentences` data set:

```r
sentence_df = tibble(sentence  = sentences)
sentence_df
```

```
## # A tibble: 720 x 1
##   sentence                                   
##   <chr>                                      
## 1 The birch canoe slid on the smooth planks. 
## 2 Glue the sheet to the dark blue background.
## 3 It's easy to tell the depth of a well.     
## 4 These days a chicken leg is a rare dish.   
## 5 Rice is often served in round bowls.       
## 6 The juice of lemons makes fine punch.      
## 7 The box was thrown beside the parked truck.
## 8 The hogs were fed chopped corn and garbage.
## # ... with 712 more rows
```

---
Extract the first word from each sentence.  Hint: negate the space character class "[ ]" to match everything except a space.

```r
sentence_df %>% mutate(first_word = str_extract(sentence, '\\w*')) %>% head(3)# doesn't include apostrophes '''
```

```
## # A tibble: 3 x 2
##   sentence                                    first_word
##   <chr>                                       <chr>     
## 1 The birch canoe slid on the smooth planks.  The       
## 2 Glue the sheet to the dark blue background. Glue      
## 3 It's easy to tell the depth of a well.      It
```

```r
sentence_df %>% mutate(first_word = str_extract(sentence, '([^ ])*'))
```

```
## # A tibble: 720 x 2
##   sentence                                    first_word
##   <chr>                                       <chr>     
## 1 The birch canoe slid on the smooth planks.  The       
## 2 Glue the sheet to the dark blue background. Glue      
## 3 It's easy to tell the depth of a well.      It's      
## 4 These days a chicken leg is a rare dish.    These     
## 5 Rice is often served in round bowls.        Rice      
## 6 The juice of lemons makes fine punch.       The       
## 7 The box was thrown beside the parked truck. The       
## 8 The hogs were fed chopped corn and garbage. The       
## # ... with 712 more rows
```

---
Return all the sentences that contain the colors "red", "blue" or "green".  Use the `|` disjunction.

```r
filter(sentence_df, str_detect(sentence, ("red|blue|green")))
```

```
## # A tibble: 54 x 1
##   sentence                                   
##   <chr>                                      
## 1 Glue the sheet to the dark blue background.
## 2 Two blue fish swam in the tank.            
## 3 The colt reared and threw the tall rider.  
## 4 The wide road shimmered in the hot sun.    
## 5 See the cat glaring at the scared mouse.   
## 6 A wisp of cloud hung in the blue air.      
## 7 He ordered peach pie with ice cream.       
## 8 Pure bred poodles have curls.              
## # ... with 46 more rows
```

---
Extract the first word ending in "s".  Use a capture group `()`, `str_match()` and the everything-but-space `[^ ]` character class.

```r
sentence_df = sentence_df %>% 
  mutate(ends_in_s = str_match(sentence, "([^ ]*s)\\b")[,2])
sentence_df
```

```
## # A tibble: 720 x 2
##   sentence                                    ends_in_s
##   <chr>                                       <chr>    
## 1 The birch canoe slid on the smooth planks.  planks   
## 2 Glue the sheet to the dark blue background. <NA>     
## 3 It's easy to tell the depth of a well.      It's     
## 4 These days a chicken leg is a rare dish.    days     
## 5 Rice is often served in round bowls.        is       
## 6 The juice of lemons makes fine punch.       lemons   
## 7 The box was thrown beside the parked truck. was      
## 8 The hogs were fed chopped corn and garbage. hogs     
## # ... with 712 more rows
```

---
(Stretch goal) Notice that two questions ago, we also matched the sentence

> The colt reared and threw the tall rider.

because "reared" contains "red". Fix the regular expression so it only matches the complete words, not just a fragment using the "\b" word start marker.  Hint:  use "\\\\b" to keep the backslash from being used as an escape character.
--

```r
filter(sentence_df, str_detect(sentence, ("\\bred\\b|\\bblue\\b|\\bgreen\\b")))
```

```
## # A tibble: 26 x 2
##   sentence                                        ends_in_s
##   <chr>                                           <chr>    
## 1 Glue the sheet to the dark blue background.     <NA>     
## 2 Two blue fish swam in the tank.                 <NA>     
## 3 A wisp of cloud hung in the blue air.           <NA>     
## 4 The spot on the blotter was made by green ink.  was      
## 5 The sofa cushion is red and of light weight.    is       
## 6 The sky that morning was clear and bright blue. was      
## 7 A blue crane is a tall wading bird.             is       
## 8 It is hard to erase blue or red ink.            is       
## # ... with 18 more rows
```