BST430 Lecture 10

class: ur-title, center, middle, title-slide

.title[
# BST430 Lecture 10
]
.subtitle[
## Text Analysis
]
.author[
### Tanzy Love, based on the course by Andrew McDavid
]
.institute[
### U of Rochester
]
.date[
### 2021-09-26 (updated: 2024-10-10 by TL)
]

---

Here's the [R code in this lecture](l10/l10-text-ii.R)

Here's the [books](l10/data/books.csv)

*You have to have a good filepath to each dataset*

---

class: middle

![](https://raw.githubusercontent.com/juliasilge/tidytext/master/man/figures/tidytext.png)

Text mining using `tidytext`

---

## Text mining using `tidytext`

Text is inherently high-dimensional and noisy data.  We could spend weeks on this. Instead, we'll have to be content to know what we don't know:

*  Sampling text data and its potential ascertainment biases
*  Handling non-Roman (ASCII) characters
*  Parsing into tokens
*  Filtering low-content words
*  Dimension reduction, e.g., latent Dirichlet allocation or non-negative matrix factorization
*  Embeddings using pre-trained neural networks

Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/).

---

## Most important functionality

*  `unnest_tokens()` split a string into tokens (words, bi-grams, etc) as a data frame
*  `bind_tf_idf` calculate term and inverse-document frequencies.
*  `cast_sparse` convert to a (sparse) document-term matrix.

---

## Austen vs Kafka

``` r
library(tidytext); library(stopwords); library(gutenbergr)
book_names = tibble(gutenberg_id = c(158, 105, 5200, 7849),
                    title = c('Emma', 'Persuasion',
                              'Metamorphosis', 'The Trial'))
books = gutenberg_download(book_names$gutenberg_id) %>% 
  left_join(book_names)
table(books$title)
```

```
## 
##          Emma Metamorphosis    Persuasion     The Trial 
##         16488          1872          8357          6677
```

---

## Austen vs Kafka - text

.scroll-box-15[

``` r
books %>% group_by(title) %>% 
  slice_head(n=6)
```

```
## # A tibble: 24 × 3
## # Groups:   title [4]
##   gutenberg_id text             title        
##          <dbl> <chr>            <chr>        
## 1          158 "Emma"           Emma         
## 2          158 ""               Emma         
## 3          158 "by Jane Austen" Emma         
## 4          158 ""               Emma         
## 5          158 ""               Emma         
## 6          158 "Contents"       Emma         
## 7         5200 "Metamorphosis"  Metamorphosis
## 8         5200 ""               Metamorphosis
## # ℹ 16 more rows
```
]

---

## Get words

``` r
book_words = unnest_tokens(books, text, output = 'word', drop = TRUE) #drops the original variable
book_words
```

```
## # A tibble: 351,095 × 3
##   gutenberg_id title      word      
##          <dbl> <chr>      <chr>     
## 1          105 Persuasion persuasion
## 2          105 Persuasion by        
## 3          105 Persuasion jane      
## 4          105 Persuasion austen    
## 5          105 Persuasion 1818      
## 6          105 Persuasion contents  
## 7          105 Persuasion chapter   
## 8          105 Persuasion i         
## # ℹ 351,087 more rows
```

---

## Count words by book

``` r
word_counts = book_words %>%
  group_by(title) %>% count(title, word) %>%
  arrange(desc(n))
word_counts %>% slice_head(n = 4)
```

```
## # A tibble: 16 × 3
## # Groups:   title [4]
##   title         word      n
##   <chr>         <chr> <int>
## 1 Emma          to     5238
## 2 Emma          the    5201
## 3 Emma          and    4896
## 4 Emma          of     4291
## 5 Metamorphosis the    1148
## 6 Metamorphosis to      753
## 7 Metamorphosis and     642
## 8 Metamorphosis he      577
## # ℹ 8 more rows
```

---

## Remove "stop" words

Stop words are common, low-semantic value words.  Sometimes useful to remove.

``` r
word_counts %>% anti_join(get_stopwords()) %>% slice_head(n = 4)
```

```
## # A tibble: 16 × 3
## # Groups:   title [4]
##   title         word         n
##   <chr>         <chr>    <int>
## 1 Emma          mr        1153
## 2 Emma          emma       786
## 3 Emma          mrs        699
## 4 Emma          miss       599
## 5 Metamorphosis gregor     199
## 6 Metamorphosis room       131
## 7 Metamorphosis gregor’s    99
## 8 Metamorphosis father      96
## # ℹ 8 more rows
```

---

## Term frequency in Kafka vs Austen

``` r
total_words = word_counts %>%
  group_by(title) %>%
  summarize(total = sum(n))
word_counts = left_join(word_counts, total_words)
word_counts
```

```
## # A tibble: 20,760 × 4
## # Groups:   title [4]
##   title      word      n  total
##   <chr>      <chr> <int>  <int>
## 1 Emma       to     5238 161113
## 2 Emma       the    5201 161113
## 3 Emma       and    4896 161113
## 4 The Trial  the    4726  84219
## 5 Emma       of     4291 161113
## 6 Persuasion the    3327  83707
## 7 Emma       i      3181 161113
## 8 Emma       a      3125 161113
## # ℹ 20,752 more rows
```

---

## Term frequency in Kafka vs Austen

``` r
ggplot(word_counts, aes(x = n/total)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~title, scales = "free_y") + 
  theme_minimal()
```

---

## Zipf's law
Distributions like those on the previous slide are typical in language.  A classic version of this relationship is called Zipf's law.

> Zipf's law states that the frequency that a word appears is inversely proportional to its rank.

---

## Zipf's law

.panelset[

.panel[.panel-name[Code]

``` r
freq_by_rank = word_counts %>%
  group_by(title) %>%
  mutate(rank = row_number(),
         `term frequency` = n/total) %>%
  ungroup()

freq_by_rank %>%
  ggplot(aes(x = rank, y = `term frequency`, color = title)) +
  geom_abline(intercept = -0.62, slope = -1,
              color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
  scale_x_log10() +
  scale_y_log10() +
  theme_minimal()
```
]
.panel[.panel-name[Plot]
<img src="l10-text-ii_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" />
]
]
---

## Sentiment analysis

``` r
sentiments
```

```
## # A tibble: 6,786 × 2
##   word        sentiment
##   <chr>       <chr>    
## 1 2-faces     negative 
## 2 abnormal    negative 
## 3 abolish     negative 
## 4 abominable  negative 
## 5 abominably  negative 
## 6 abominate   negative 
## 7 abomination negative 
## 8 abort       negative 
## # ℹ 6,778 more rows
```

---

## Add sentiments to words

``` r
word_sentiments = word_counts %>%
  mutate(counts = n) %>%
  left_join(sentiments, relationship = "many-to-many") %>% 
                                    #multiple matches here
  filter(!is.na(sentiment)) %>%     #already grouped!
  mutate(word_collapse = fct_lump_n(word, n = 10, w = counts),) %>%
    #it needs to be told that n are the weights of these words
    select(title, word_collapse, sentiment, counts)
word_sentiments
```

```
## # A tibble: 4,269 × 4
## # Groups:   title [4]
##   title      word_collapse sentiment counts
##   <chr>      <fct>         <chr>      <int>
## 1 Emma       miss          negative     599
## 2 Emma       well          positive     401
## 3 Emma       good          positive     359
## 4 Emma       great         positive     264
## 5 The Trial  like          positive     212
## 6 Emma       like          positive     200
## 7 Persuasion good          positive     187
## 8 Emma       better        positive     173
## # ℹ 4,261 more rows
```

---

##  Which is more happy?

``` r
ggplot(word_sentiments, aes(y = fct_reorder(word_collapse,  counts, .fun = sum), 
                            x = counts, fill = sentiment)) + 
  geom_col() + facet_wrap(~title, scales = 'free_x') + 
  ylab("Word") + xlab("Occurrence") + theme_minimal()
```

---

## Term frequency and inverse document frequency

The inverse document frequency is

`$$\text{idf}(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$`

The IDF thus ranges from 0 for words that appear in every document up to `$log(n)$` for a word unique across documents.

The term frequency is just the word counts, normalized to the number of words per text, so the popular TF-IDF<sup>1</sup> metric is just

`$$\text{tf-idf}(\text{term}) = \text{idf}(\text{term}) \times \text{tf}(\text{term})$$`
.footnote[[1] Popular, and curiously devoid of an obvious statistical model.  [Some attempts to link to information theory](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Justification_of_idf) have been made.]
---

## Calculate TF-IDF

``` r
word_counts = word_counts %>% bind_tf_idf(word, title, n)
word_counts
```

```
## # A tibble: 20,760 × 7
## # Groups:   title [4]
##   title      word      n  total     tf   idf tf_idf
##   <chr>      <chr> <int>  <int>  <dbl> <dbl>  <dbl>
## 1 Emma       to     5238 161113 0.0325     0      0
## 2 Emma       the    5201 161113 0.0323     0      0
## 3 Emma       and    4896 161113 0.0304     0      0
## 4 The Trial  the    4726  84219 0.0561     0      0
## 5 Emma       of     4291 161113 0.0266     0      0
## 6 Persuasion the    3327  83707 0.0397     0      0
## 7 Emma       i      3181 161113 0.0197     0      0
## 8 Emma       a      3125 161113 0.0194     0      0
...
```

---

## TF-IDF of Kafka and Austen

This works relatively well to identify signature words -- some represent content,  some represent author style (e.g. contractions used by Kafka)

---

## Occurrence matrix

Lastly, we might want to convert our counts to an occurrence matrix `$\mathbf X = [x_{ij}]$` where `$x_{ij}$` is the number of times document `$i$` contains term `$j$`.

Most `$x_{ij}$` will be zero, reflecting Zipf's law.  We will almost always want to store it in a special format called a .alert[sparse matrix], that only stores the non-zero entries and their index in the matrix.

---

## `cast_sparse()`

``` r
X = cast_sparse(word_counts, title, word, n)
class(X)
```

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

``` r
dim(X)
```

```
## [1]     4 11476
```

``` r
sum(X>0)
```

```
## [1] 20760
```

This is useful for downstream modeling, such as latent Dirichlet allocation.

---

## what is `cast_sparse()`

``` r
X[,10001:10004]
```

```
## 4 x 4 sparse Matrix of class "dgCMatrix"
##               nosegay noted nothings notoriety
## Emma                .     .        .         .
## The Trial           .     1        .         .
## Persuasion          1     1        1         1
## Metamorphosis       .     .        .         .
```

``` r
rowSums(as.matrix(X))
```

```
##          Emma     The Trial    Persuasion Metamorphosis 
##        161113         84219         83707         22056
```