Tree-Based Methods

9. Tree-Based Methods#

Tree-based methods involve stratifying or segmenting the predictor space into a number of simple regions. They are popular because they are easy to interpret and mimic human decision-making.

9.1. 1. Decision Trees#

9.1.1. Ideally Simple#

Think of a game of “20 Questions”. To identify an object, you ask yes/no questions that best split the possibilities. A Decision Tree does exactly this.

9.1.2. Terminology#

Terminal Nodes (Leaves): The final regions/buckets where we assign a prediction.
Internal Nodes: The points where the predictor space is split (e.g., \(X_1 < 5\)).
Branches: The segments connecting nodes.

9.1.3. How to build a tree?#

We use Recursive Binary Splitting, a top-down, greedy approach.

Top-down: Start at the top of the tree with all observations.
Greedy: At each step, choose the best split that is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

9.1.4. Classification Trees#

For classification, we cannot use Mean Squared Error (RSS). Instead, we aim for Node Purity (we want leaves to contain mostly one class). Measures include:

Gini Index: A measure of total variance across the \(K\) classes.
Entropy: Another measure of impurity.

Small Gini/Entropy indicates a pure node.

9.1.5. Pros & Cons#

Pros: Highly interpretable, can handle qualitative predictors without dummy variables.
Cons: High Variance. A small change in data can cause a completely different tree.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Generate synthetic data
X, y = make_classification(n_samples=500, n_features=2, n_informative=2, 
                           n_redundant=0, random_state=42, n_clusters_per_class=1)

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Fit a Single Decision Tree
# max_depth limit helps prevent overfitting
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train, y_train)

# 4. Predict
y_pred = tree_clf.predict(X_test)
print(f"Single Tree Accuracy: {accuracy_score(y_test, y_pred):.2f}")

# 5. Visualize the Tree
plt.figure(figsize=(10,6))
plot_tree(tree_clf, filled=True, feature_names=['Feature 1', 'Feature 2'], 
          class_names=['Class 0', 'Class 1'])
plt.title("Decision Tree Visualization")
plt.show()

9.2. 2. Ensemble Methods: Bagging & Random Forests#

To fix the High Variance problem of single trees, we use Ensemble Methods (combining many trees).

9.2.1. Bagging (Bootstrap Aggregation)#

Imagine asking 100 people to guess the number of jellybeans in a jar. The average of their guesses is usually better than any single guess.

Take \(B\) bootstrap samples from training data.
Train a deep tree on each.
Average the predictions (regression) or take Majority Vote (classification).

9.2.2. Random Forests#

Bagging has a flaw: if there is one very strong predictor, all trees will use it as the top split, making the trees highly correlated. Averaging correlated trees doesn’t reduce variance much.

Random Forest Solution: At each split, we are ONLY allowed to consider a random subset of predictors (\(m \approx \sqrt{p}\)).

This forces trees to specific details and generally decorrelates them.
Result: Much more robust performance.

from sklearn.ensemble import RandomForestClassifier

# Fit Random Forest
# n_estimators = number of trees
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)

print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred):.2f}")

9.3. 3. Boosting#

Boosting is a different approach. Instead of building independent trees in parallel (like Bagging/RF), we build trees sequentially.

Train a small tree.
Look at where it made mistakes (residuals).
Train the next tree specifically to fix those mistakes.
Add this new tree to the model (weighted by a learning rate).

Boosting usually has lower Bias than Bagging, but can overfit if not tuned carefully.

from sklearn.ensemble import GradientBoostingClassifier

# Fit Gradient Boosting
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_clf.fit(X_train, y_train)
gb_pred = gb_clf.predict(X_test)

print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, gb_pred):.2f}")

9.4. 4. Quiz#

Q1. Why do we use Random Forests instead of just Bagging? A) To increase the bias of the model. B) To decorrelate the trees and further reduce variance. C) To make the model faster to train.

Q2. Which method builds trees sequentially? A) Decision Trees B) Random Forests C) Boosting

Q3. In a classification tree, what does a Gini Index of 0 mean? A) Maximum impurity (node has equal mix of classes). B) The node is pure (all observations belong to one class). C) The tree has no branches.

9.4.1. Sample Answers#

Q1: B). By restricting features at each split, RF ensures trees are diverse, making the average more stable. Q2: C). Boosting learns essentially from the errors of previous trees. Q3: B). Gini Index measures impurity. 0 means perfect purity.