BST430 Lecture 20

class: ur-title, center, middle, title-slide

.title[
# BST430 Lecture 20
]
.subtitle[
## Trees and Forests
]
.author[
### Tanzy Love
]
.institute[
### U of Rochester
]
.date[
### 2024-11-10 (updated: 2024-11-14)
]

---

# Overview

*  Tree learners (CART)
*  Bootstrap Aggregation (Bagging)
*  Random Subspaces
*  Random Forests
*  Boosting
*  Bayesian Ensembles of trees (BART)
*  Grafted and Vanishing Random Subspaces
  
Here's the [R code in this lecture](l20/l20-trees-and-forests.R)

---
  
## Classification and Regression Trees (CART)

.pull-left[
Proposed by Leo Breiman in 1984
_Problem_: Linear regression models often perform poorly with complex real-world data

_Idea_: Try fitting simple regression models to different partitions of the covariate space to achieve a better fit

_Solution_: Partition up the covariate space using a binary classification tree and fit a model to each subspace
]
.pull-right[
<img src="l20/Leo_Breiman.jpg" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Growing a Regression Tree:

*The data consists of `$p$` inputs and a response for each of `$N$` observations, that is, `$(x_i,y_i)$` for `$i = 1,\dots, N$`, with `$x_i = (x_{i1}, \dots, x_{ip})$`

*The algorithm sequentially identifies a variable on which to make a split/partition as well as the respective split point/value

*CART partitions the covariate space into `$M$` distinct, non-overlapping regions `$R_1, \dots, R_M$` and we model the response as a constant `$\gamma_m$` in each of the regions as follows:

$$
T(x) = \sum_{m=1}^M \gamma_m I(x \in R_m)
$$

---

## Boston Housing Data Example
    
<img src="l20/fig_boston_partitions.png" width="100%" style="display: block; margin: auto 0 auto auto;" />
*
---

``` r
library(mlbench) #you probably need to install thsi library
data("BostonHousing")
names(BostonHousing)
```

```
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"    
## [10] "tax"     "ptratio" "b"       "lstat"   "medv"
```

``` r
ggplot(BostonHousing, aes(x=lstat, y=rm, color=medv)) +
geom_point()
```

<img src="l20-trees-and-forests_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" />
---

## Boston Housing Data Tree

.pull-left[
* The first split `$(\# \text{rooms} < 6.941)$` partitions the space into `$R_1$` and `$R_2$` 
* The second split `$(\% \text{lower status})$` further divides up the `$R_2$` subspace
]
.pull-right[
<img src="l20/fig_boston_cart_tree.png" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Boston Housing Data Tree

.pull-left[
* The solution that minimizes the sum of squared errors uses the average of the `$y_i$` in the region `$R_m$` as the estimates for the `$\gamma_m$`'s
$$
\hat{\gamma}_m = \text{ave}(y_i|x_i \in R_M)
$$
]
.pull-right[
<img src="l20/fig_boston_skeleton_partitions.png" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---

## How do we pick the splits?:

* Consider a split variable `$k$` and value at which to split it `$s$` and define the pair of half-planes
`\begin{equation}
\underset{k,s}{\operatorname{min}} \Bigg[
\underset{\gamma_1}{\operatorname{min}}
\sum_{x_i \in R_1(k,s)} (y_i-\gamma_1)^2 +
\underset{\gamma_2}{\operatorname{min}}
\sum_{x_i \in R_2(k,s)} (y_i-\gamma_2)^2\Bigg]
\end{equation}`
* Inner minimization is solved by `$\hat{\gamma}_1 = \text{ave}(y_i|x_i \in R_1(k,s))$` and `$\hat{\gamma}_2 = \text{ave}(y_i|x_i \in R_2(k,s))$`
* The partition is made on the best available split (greedy) identified using equation 2, then the process is repeated

---

## How/when do we stop?: Cost-Complexity Pruning

* Add up all the squared deviations from the leaf-level estimates
* Add a penalty that grows with the number of partitions
* Minimize the penalized sum of deviations

`\begin{equation}
\sum_{m=1}^{|M|} \sum_{x_i \in R_m} (y_i - \hat{\gamma}_m)^2 + \alpha |M|
\end{equation}`

---

``` r
library(rpart) #you probably need to install this library
#rpart(medv ~., data = BostonHousing, control = rpart.control(maxdepth = 2))
rpart(medv ~., data = BostonHousing, control = rpart.control(cp = .1))
```

```
## n= 506 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 506 42716.300 22.53281  
##   2) rm< 6.941 430 17317.320 19.93372  
##     4) lstat>=14.4 175  3373.251 14.95600 *
##     5) lstat< 14.4 255  6632.217 23.34980 *
##   3) rm>=6.941 76  6059.419 37.23816 *
```

``` r
set.seed(1234)
BostonSplit = initial_split(BostonHousing, prop=0.80)
trained1 = rpart(medv ~., data = training(BostonSplit), control = rpart.control(cp = .1))
cat("\n Sum Squared Error for one Tree", sum((predict(trained1, testing(BostonSplit)) - testing(BostonSplit)$medv)^2),"\n")
```

```
## 
##  Sum Squared Error for one Tree 3333.595
```

---

## Ensemble Methods

_Ensemble learning_: methods that join together *simple* models or *weak* learners to form a committee or ensemble

* Ensembles leverage the combined strength of their base models to achieve increased predictive performance greater than that of the individual learners

* Generally speaking, ensembles are made stronger when there is disagreement and very little correlation among the learners

* "Diversity and independence are important because the best collective decisions are the product of disagreement and contest, not consensus or compromise." -James Surowiecki, The Wisdom of Crowds

---

## Bootstrap Aggregation (Bagging)

* Ensemble procedure that reduces variance in the estimate `$\hat{f}(x)$` by averaging over predictions from individual trees (bias unchanged)

* Bagging with trees:

* Draw `$B$` bootstrapped samples from the data `$\boldsymbol{Z} = \{(x_1,y_1), \dots, (x_n,y_n)\}$`
 * For each bootstrap sample `$\boldsymbol{Z}_{b}$`,  `$b = 1,\dots,B$`, fit a tree `$T(x; \theta_b)$` and obtain predictions, then average predictions across trees:
`\begin{equation}
\hat{f}_\text{Bagged}(x) = 
\frac{1}{B} \sum_{b=1}^B T(x; \theta_b),
\end{equation}`
where `$\theta_b$` characterizes the `$b^\text{th}$` tree (split variables, cut points, terminal node values)

---

``` r
library(ipred) #you probably need to install this library
bagging(medv ~., data = BostonHousing, nbagg=100)
```

```
## 
## Bagging regression trees with 100 bootstrap replications 
## 
## Call: bagging.data.frame(formula = medv ~ ., data = BostonHousing, 
##     nbagg = 100)
```

``` r
#bagging(medv ~., data = BostonHousing, nbagg=100)$mtrees[[25]]$btree

trained2 = bagging(medv ~., data = training(BostonSplit), nbagg=100)
cat("\n Sum Squared Error for Bagging", sum((predict(trained2, testing(BostonSplit)) - testing(BostonSplit)$medv)^2),"\n")
```

```
## 
##  Sum Squared Error for Bagging 1660.073
```

---

## Random Subspaces
    
* Proposed by Tin Kam Ho in 1998, also known as "attribute bagging"

* Random subspaces with trees:

* Draw `$B$` randomly chosen subsets of the predictor variables (referred to as "feature subsets") from the data, each of size  `$r < p$`
    * For each feature subset `$X^{b}_{\{n \times r\}}$` `$b = 1,\dots,B$` fit a tree `$T(x; \theta_b)$` and obtain the predictions, then average predictions over trees:
`\begin{equation}
\small
\hat{f}_\text{RSM}(x) = 
\frac{1}{B} \sum_{b=1}^B T(x; \theta_b)
\end{equation}`

* Building trees on different feature subsets can reduce correlation between trees making for a stronger ensemble

---

## Random Forests
    
* Proposed by Leo Breiman in 2001

* RF both bootstraps the data and the attributes

* Improves the variance reduction of bagging by reducing correlation among trees in the ensemble
* Draw `$B$` bootstrapped samples from the data `$\boldsymbol{Z} = \{(x_1,y_1), \dots, (x_n,y_n)\}$`
* For each bootstrap sample `$\boldsymbol{Z}_{b}$`, `$b = 1,\dots,B$`,
    * select mtry$=r$ of the variables  `$X^{b}_{\{n \times r\}}$`
    * fit a tree `$T(x; \theta_b)$`, with mtry candidate variables
    * obtain the predictions, then average predictions over trees:
`\begin{equation}
\small
\hat{f}_\text{RF}(x) = 
\frac{1}{B} \sum_{b=1}^B T(x; \theta_b)
\end{equation}`

---

``` r
library(randomForest) #you probably need to install this library
randomForest(medv ~., data = BostonHousing, mtry=4)
```

```
## 
## Call:
##  randomForest(formula = medv ~ ., data = BostonHousing, mtry = 4) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 10.014
##                     % Var explained: 88.14
```

``` r
trained3 = randomForest(medv ~., data = training(BostonSplit), mtry=3)
cat("\n Sum Squared Error for randomForest", sum((predict(trained3, testing(BostonSplit)) - testing(BostonSplit)$medv)^2),"\n")
```

```
## 
##  Sum Squared Error for randomForest 1295.292
```

---

## Boosting
    
* Propsed by Robert Schapire in 1990, sum of trees ensemble

* Fit tree `$T(x, \theta_b)$` to residuals from ensemble consisting of all trees that came before it (instead of `$y$`):
`\begin{equation}
\small
\hat{\theta_b} =
\underset{\theta_b}{\operatorname{argmin}}
\Bigg[
\sum_{i = 1}^n L(y_i, f_{b-1}(x_i) +
T(x_i, \theta_b))
\Bigg],
\end{equation}`

* Fitting each tree to ensemble residuals allows ensemble to improve in areas where it performs poorly

* Boosted tree model is the sum over these trees
`\begin{equation}
\hat{f}_\text{Boost}(x) = \sum_{b=1}^B T(x; \theta_b)
\end{equation}`

---

## Boosting

* Shrunken version of new tree is added to the ensemble
`\begin{equation}
f_b(x) = f_{b-1}(x) + \omega T(x_i, \theta_b)
\end{equation}`
`\begin{equation}
\hat{f}_\text{Boost}(x) = \sum_{b=1}^B \omega T(x; \theta_b)
\quad 0 \le \omega \le 1
\end{equation}`

* Shrinkage prevents any one tree from being overly influential

* Shrinkage parameter `$\omega$` controls rate at which boosting learns (smaller `$\omega$` values with large forests sizes tend to work well)

---

``` r
library(mboost) #you probably need to install this library
blackboost(medv ~., data = BostonHousing, control = boost_control(mstop = 50))
```

```
## 
## 	 Model-based Boosting
## 
## Call:
## blackboost(formula = medv ~ ., data = BostonHousing, control = boost_control(mstop = 50))
## 
## 
## 	 Squared Error (Regression) 
## 
## Loss function: (y - f)^2 
##  
## 
## Number of boosting iterations: mstop = 50 
## Step size:  0.1 
## Offset:  22.53281 
## Number of baselearners:  1
```

``` r
trained4 = blackboost(medv ~., data = training(BostonSplit))
cat("Sum Squared Error for Boosting", sum((predict(trained4, testing(BostonSplit)) - testing(BostonSplit)$medv)^2),"\n")
```

```
## Sum Squared Error for Boosting 1231.961
```

---

## Combining Ensemble Methods
    
* Much work has been done in combining data-partitioning methods with feature-partitioning methods and also with boosting

---

##  Bayesian Additive Regression Trees (BART)

* Instead of fitting a greedy optimizer to a tree, we can put a prior distribution on trees and get a posterior distribution for the tree

* If we build an ensemble of Bayesian trees in the right way, we get a BART model

* Details omitted

---

``` r
library(dbarts) #you probably need to install this library
set.seed(99)
bartFit <- dbarts::bart(BostonHousing[-14], BostonHousing$medv)
```

```
## 
## Running BART with numeric y
## 
## number of trees: 200
## number of chains: 1, number of threads 1
## tree thinning rate: 1
## Prior:
## 	k prior fixed to 2.000000
## 	degrees of freedom in sigma prior: 3.000000
## 	quantile in sigma prior: 0.900000
## 	scale in sigma prior: 0.002166
## 	power and base for tree prior: 2.000000 0.950000
## 	use quantiles for rule cut points: false
## 	proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
## 	number of training observations: 506
## 	number of test observations: 0
## 	number of explanatory variables: 13
## 	init sigma: 4.745298, curr sigma: 4.745298
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100) 
## (11: 100) (12: 100) (13: 100) 
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.764487
## 
## Tree sizes, last iteration:
## [1] 4 3 3 3 3 3 2 2 2 2 3 2 2 2 2 2 1 2 
## 3 3 2 3 4 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 
## 3 2 2 4 2 2 2 4 2 2 3 1 3 2 2 2 2 2 3 2 
## 2 3 3 2 4 3 2 2 3 2 4 5 2 3 3 3 2 3 2 3 
## 3 3 3 3 2 4 2 2 2 2 2 2 1 1 4 3 2 3 3 3 
## 3 2 2 4 3 2 2 3 2 2 4 2 2 1 2 3 2 2 2 3 
## 3 2 2 2 3 3 3 3 3 2 2 2 4 2 2 3 1 2 2 4 
## 2 2 3 2 2 2 2 3 2 2 2 2 2 2 4 3 3 3 4 2 
## 3 2 3 2 1 2 2 2 2 3 3 3 4 3 2 2 3 4 2 3 
## 3 2 2 2 2 4 2 4 3 2 4 2 2 4 2 2 2 2 5 3 
## 2 3 
## 
## Variable Usage, last iteration (var:count):
## (1: 20) (2: 23) (3: 18) (4: 14) (5: 19) 
## (6: 39) (7: 15) (8: 23) (9: 26) (10: 34) 
## (11: 24) (12: 20) (13: 24) 
## DONE BART
```
---

``` r
plot(bartFit)
```

``` r
trained5 = dbarts::bart(training(BostonSplit)[-14], training(BostonSplit)$medv, 
              keeptrees = TRUE) # must keep trees to do predictions
```

```
## 
## Running BART with numeric y
## 
## number of trees: 200
## number of chains: 1, number of threads 1
## tree thinning rate: 1
## Prior:
## 	k prior fixed to 2.000000
## 	degrees of freedom in sigma prior: 3.000000
## 	quantile in sigma prior: 0.900000
## 	scale in sigma prior: 0.002051
## 	power and base for tree prior: 2.000000 0.950000
## 	use quantiles for rule cut points: false
## 	proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
## 	number of training observations: 404
## 	number of test observations: 0
## 	number of explanatory variables: 13
## 	init sigma: 4.617533, curr sigma: 4.617533
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100) 
## (11: 100) (12: 100) (13: 100) 
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.906150
## 
## Tree sizes, last iteration:
## [1] 2 2 3 3 4 2 2 1 3 2 2 2 2 2 3 2 3 2 
## 3 2 4 2 3 2 2 2 3 2 2 4 2 3 2 2 1 3 2 3 
## 3 2 2 3 2 3 2 3 4 2 3 2 2 2 2 3 2 2 2 3 
## 3 3 4 2 2 2 3 3 2 2 2 2 2 2 4 5 3 4 2 3 
## 2 2 2 1 4 2 3 2 2 2 2 4 4 2 2 3 3 2 3 2 
## 4 2 2 2 2 4 2 2 2 2 2 1 3 2 2 3 3 3 2 3 
## 5 3 2 2 2 2 2 2 2 2 2 2 3 2 2 5 2 5 2 2 
## 2 3 2 3 2 2 2 2 2 2 2 2 2 3 1 1 2 2 2 2 
## 2 3 3 3 3 3 2 3 2 2 2 2 3 2 1 5 2 2 4 4 
## 3 3 2 2 4 2 2 2 2 3 2 2 2 4 3 2 3 2 2 2 
## 6 2 
## 
## Variable Usage, last iteration (var:count):
## (1: 14) (2: 11) (3: 18) (4: 16) (5: 31) 
## (6: 32) (7: 23) (8: 24) (9: 18) (10: 25) 
## (11: 21) (12: 28) (13: 33) 
## DONE BART
```

``` r
cat("\n Sum Squared Error for BART", sum((colMeans(predict(trained5, testing(BostonSplit))) - testing(BostonSplit)$medv)^2),"\n")
```

```
## 
##  Sum Squared Error for BART 1100.773
```

---

```
## 
##  Sum Squared Error for one Tree 3333.595
```

```
## 
##  Sum Squared Error for Bagging 1660.073
```

```
## 
##  Sum Squared Error for randomForest 1295.292
```

```
## 
##  Sum Squared Error for Boosting 1231.961
```

```
## 
##  Sum Squared Error for BART 1100.773
```

---

## Grafting and Vanishing Random Subspaces

_Problem_:

* Procedures that use random sampling of the input variables for split candidates (e.g. Random Forests, RSM) suffer when the number of truly informative features `$s$` is small relative to `$p$`

* Feature (input variable) subsets are likely to contain many non-informative features

* Learners built on these subsets can be harmful to ensemble

_Solution_:

* Allow each tree to share information regarding variable importance in its feature subset with the trees that come after it in the ensemble

---

# Acknowledgments

Adapted from Matthew Corsetti's [Thesis](https://www.urmc.rochester.edu/education/graduate/statistics-and-facts/2021-2022-thesis-titles.aspx)

Simplilearn [Lesson 8](https://www.simplilearn.com/tutorials/data-science-tutorial/random-forest-in-r)

## Resources

- [Hands-On Machine Learning with R](https://bradleyboehmke.github.io/HOML/random-forest.html) by Bradley Boehmke & Brandon Greenwell

---

## References
* Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Boca Raton, FL: CRC Press.
* Ho, T. K. (1995). Random Decision Forests.Proceedings of the 3rd International Conference on Document Analysis and Recognition, 278-282.
* Hastie, T., Tibshirani, R., & Friedman, J. H. (2004). The elements of statistical learning: Data mining, inference, and prediction: With 200 full-color illustrations. New York: Springer.
* Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197-227.
* Corsetti M. A. &  Love T. M. (2022).Grafted and Vanishing Random Subspaces. Pattern Anal Appl. 25(1), 89-124.