4. Sampling#

  • Sometimes we need to test the model performance, but we can not use the same data as training and testing. therefore we need sampling strategies.

  • Training error (blue line): Training error, also known as in-sample error, is the error that occurs when a model is applied to the data it was trained on.

  • Testing error (red line): Testing error, or out-of-sample error, is the error that occurs when the model is applied to a new dataset that it hasn’t seen before, often a validation or test set.

image-3.png

4.1. Estimate Testing error#

  • When we use training error to estimate testing error, we often get a underestimated result. Some methods make a mathematical adjustment to the training error rate in order to estimate the test error rate.

    • 2ln(L) use the same way as RSS, in linear regression, 2log(L)=RSS/σ2

4.1.1. AIC (Akaike Information Criterion)#

AIC=2k2ln(L)
  • k is the number of parameters in the model

  • L is the maximum value of the likelihood function for the model

    • Normally, we assume errors are normally distributed (offsets between expectation and observation)

4.1.2. BIC (Bayesian Information Criterion)#

BIC=ln(n)k2ln(L)
  • n is the number of observations

  • The other terms are as defined for AIC

  • when n>7, BIC put more penalty than AIC

4.1.3. Example#

  • Want to compare linear vs higher-order polynomial terms in a linear regression

image-3.png

import numpy as np
import statsmodels.api as sm
# Generate some data
np.random.seed(0)
X = np.random.uniform(-10, 10, 100)
Y = 3 - 2*X + X**2 + np.random.normal(0, 2, 100)

# Fit models of varying complexity
results = {}
for degree in range(1, 6):
    features = np.column_stack([X**i for i in range(1, degree + 1)])
    features = sm.add_constant(features)
    model = sm.OLS(Y, features).fit()
    results[degree] = {'model': model, 'AIC': model.aic, 'BIC': model.bic}

# Find the best model according to AIC and BIC
best_by_aic = min(results, key=lambda k: results[k]['AIC'])
best_by_bic = min(results, key=lambda k: results[k]['BIC'])

print(f'Best model by AIC is of degree {best_by_aic} with AIC = {results[best_by_aic]["AIC"]}')
print(f'Best model by BIC is of degree {best_by_bic} with BIC = {results[best_by_bic]["BIC"]}')
Best model by AIC is of degree 2 with AIC = 425.7373213865948
Best model by BIC is of degree 2 with BIC = 433.55283194455905

4.2. Cross Validation#

  • In normal validation approach, only a subset of the observation are used to fit the model

    • The validation error may tend to overestimate the test error for the model fit on the entire data set

    • For some shrinkage method (e.g. Lasso and Ridge regression), it’s hard to find the number of predictors k

    • Cross validation provide a direct way to test error, instead of using the variance of error σ2

4.2.1. K-fold Cross Validation#

  • Let the K parts be C1,C2,...,CK, where Ck denotes the indices of the observations in part k. There’re nk observations in part k, where nk=n/K

  • Compute (regression)

    CV(K)=k=1KnknMSEk
  • For Classification, Compute * Where Errk=iCkI(yiyi^)/nk $CV(K)=k=1KnknErrk$

  • basically the weighted sum of MSEk by number of observations in k

  • Since each training set is only (K1)/K, the estimates of prediction error will typically be biased upward.

  • K=5 or 10 provides a good compromise for bias-variance tradeoff.

image-3.png

4.2.2. Example using Cross Validation#

image-3.png

4.2.3. Common mistakes in Cross Validation#

image-3.png

4.2.3.1. Feature selection before Cross validation#

  • Performing feature selection before cross-validation can lead to data leakage and overfitting, as you’re peeking at the entire dataset to select features based on their correlation with the target variable.

  • This can result in overly optimistic performance estimates.

4.2.3.2. Example#

  • We randomly generate 50 predictors with different mean and variance normal distribution, and random label Y

  • We select 5 predictors having the largest correlation with the radom label

  • Then we put this 5 predictors into logistic regression to get CV Score

  • Theoretically, we will get CV score around 50% since the independent simulation, but result gives us much higher score

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from scipy.stats import pearsonr
# Set random seed for reproducibility
np.random.seed(5)

# Generate X with 50 random arrays following normal distribution with different mean and variance
X = np.random.normal(loc=np.random.uniform(-10, 10, size=50), scale=np.random.uniform(1, 5, size=50), size=(100, 50))

# Generate y as random labels from uniform distribution
y = np.random.randint(2, size=100)

# Step 2: Select the 5 features most correlated with the outcome
## This step is wrong since model have seen 
## all labels to get most correlated predictors
correlations = np.array([pearsonr(X[:, i], y)[0] for i in range(X.shape[1])])
top_five_indices = np.argsort(np.abs(correlations))[-5:]

# Step 3: Perform cross-validation using only the selected features
selected_X = X[:, top_five_indices]
model = LogisticRegression()

# Use cross-validation to estimate the test error
cv_scores = cross_val_score(model, selected_X, y, cv=5)

# The mean cross-validated score can be misleadingly high due to feature selection bias
mean_cv_score = np.mean(cv_scores)
print("Cross-validation scores:", cv_scores)
print(f'Mean CV score: {mean_cv_score}')
Cross-validation scores: [0.55 0.7  0.45 0.75 0.75]
Mean CV score: 0.64

4.3. Bootstrap#

  • The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

    • For example, it provides the estimate of standard error of a coefficient, or a confidence interval for that coefficient.

4.3.1. Example for Bootstrap#

  • We want to invest our money in two financial assets that yield returns of X and Y

  • We will invest a fraction α of our money in X, and 1α in Y

  • We wish to choose α to minimize the total risk, or variance (Var(αX+(1α)Y)), of our investment. $Var(αX+(1α)Y)=α2Var(X)+(1α)2Var(Y)+2α(1α)CoV(XY)fα=2αVar(x)2(1α)Var(Y)+2(1α)CoV(XY)2α(XY)$

  • Let fα=0, we get $α=Var(Y)Cov(XY)Var(X)+Var(Y)2CoV(XY)$

  • We can estimate Var(X),Var(Y) and CoV(XY) using sampling / simulation

image-3.png

image-3.png

4.3.2. Why we need Bootstrap#

  1. In real word analysis, we cannot generate new samples from the original population.

    • Bootstrap allows us to mimic the process of obtaining new data sets. So we can estimate the variability of our estimate without generating additional samples

    • Each bootstrap datasets is created by sampling with replacement, and is the same size as the original dataset.

    • There is about a two-thirds overlap in a bootstrap sample with the original data

    image-3.png

  2. Bootstrap requires data being independent to each other, some data like time series are not applicable for Bootstrap

    • For time series, we can divide data by blocks, then use block to do bootstrap

      image-3.png