C6 Model Selection

We Need Model Selection¶

Model selection is essential because it helps us choose the most appropriate model among a set of candidate models.
In the case of linear models, such as linear regression, there are often multiple ways to specify the model, including
- Different combinations of predictors, polynomial terms, and interaction terms
Selecting the best model ensures that our model performs well on unseen data and avoids overfitting or underfitting.

Subset selection involves identifying and selecting a subset of predictors from a larger pool of variables.
Model selection should use $C_p$, $AIC$, $BIC$ or deviance. Using $RSS$ or $R^2$ is inappropriate
- Deviance: negative two times the maximized log-likelihood
There are two main approaches to subset selection:
- Best Subset selection
  - Computational expensive
- Forward and Backward Stepwise Selection
  - Forward and backward stepwise selection are iterative approaches that start with either no predictors (forward selection) or all predictors (backward selection) and sequentially add or remove predictors based on certain criteria until the optimal subset is found.
  - Approximate function, do not guarantee the best model

We can indirectly estimate test error by making an adjustment to the training error
- $C_p$, $AIC$, $BIC$, Adjusted $R^2$
- Adjusted $R^2$ = $1-\frac{RSS/(n-d-1)}{TSS/(n-1)}$
  - $d$ is the number of predictors
  - Not generalized to other model like logistic regression, not that strong theory support
We can directly estimate the test error using either a validation set approach or cross-validation approach.

Shrinkage methods, also known as regularization techniques, penalize the coefficients of predictors to discourage overfitting and improve the generalization ability of the model.
Shrinkage methods are computationally efficient and can handle a large number of predictors.
- Lasso regression, in particular, performs variable selection by setting some coefficients to zero, leading to sparse models.

Add penalty of coefficients to RSS $R S S = \sum_{i}^{n} (y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{i j})$
Ridge Regression want to minimize
- $\lambda > 0$ is a tuning parameter
- $\lambda \sum_{j=1}^{p}\beta_j^2$ called a shrinkage penalty
  - It's small when coefficients are small, it shrinking the estimates of $\beta_j$ towards zero
- Since the penalty is squared of coefficients, the estimates of ridge regression is scale sensitive
  - It's best to apply ridge regression after standardizing the predictors
- Ridge regression push coefficients toward 0 but never set them to 0 $R S S + λ \sum_{j = 1}^{p} β_{j}^{2} = \sum_{i}^{n} (y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{i j}) + λ \sum_{j = 1}^{p} β_{j}^{2}$

Similarly, add penalty of coefficients to RSS but using absolute value
- In statistical parlance, the lasso use an $l_1$ penalty instead of an $l_2$ penalty.
- $l_1$ has the effect of forcing some of coefficient estimates to be exactly 0. Therefore, lasso yields sparse models, that is, models that involve only a subset of variables.
Lasso Regression want to minimize $R S S + λ \sum_{j = 1}^{p} | β_{j} | = \sum_{i}^{n} (y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{i j}) + λ \sum_{j = 1}^{p} | β_{j} |$

Since we are push coefficients to 0 to product a spare model, use $C_p$, $AIC$, $BIC$ might not a good idea
Use cross-validation provides a simple way to tackle the unknown number of predictors problem