2. Regression#

2.1. Simple Linear Regression#

Y=β0+β1x+e

  • Y represents the dependent variable or the variable we are trying to predict or explain.

  • x represents the independent variable or the predictor variable.

  • β0 is the intercept of the regression line, which is the predicted value of Y when x is zero.

  • β1 is the slope of the regression line, representing the average change in Y for a one-unit change in x.

  • e stands for the error term (also known as the residual), which is the difference between the observed values and the values predicted by the model.

image.png

# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Generate some random data for demonstration
np.random.seed(0) # Seed for reproducibility
x = np.random.rand(100, 1) # 100 random numbers for independent variable
y = 2 + 3 * x + np.random.randn(100, 1) # Dependent variable with some noise

# Create a linear regression model
model = LinearRegression()

# Fit the model with our data (x - independent, y - dependent)
model.fit(x, y)

# Print the coefficients
print("Intercept (beta_0):", model.intercept_)
print("Slope (beta_1):", model.coef_)
Intercept (beta_0): [2.22215108]
Slope (beta_1): [[2.93693502]]
# Use the model to make predictions
y_pred = model.predict(x)

# Plotting
plt.scatter(x, y, color='blue') # actual data points
plt.plot(x, y_pred, color='red') # our model's predictions
plt.title('Simple Linear Regression')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
../../_images/1831aafd86aa4f5485189a6e0ce102e4fa2c052916aacc37e9cac78a7e00ab36.png

2.1.1. Find Best estimator of β1#

2.1.1.1. Ordinary Least Squares#

  • The goal is to find the values of β0 and β1 that minimize the sum of the squared differences (residuals) between the observed values and the values predicted by the linear model.

  • Minimize(e)=((yi(β0+β1xi))2), where yi and xi are the observed values.

  • Steps to calculate it

  1. Calculate the partial derivatives of intercept β0 and let it equal to 0

    • eβ0=i2(yiβ0βixi)(1)=0

    • eβ0=iβ1xinβ0iyi=0

    • iβ1xi+nβ0iyi=0nβ1x¯+nβ0ny¯=0

    • nβ1x¯+nβ0n\*y¯=0β1x¯+β0y¯=0

    • β1x¯+β0y¯=0β0=y¯β1x¯

  2. Calculate the partial derivative of slope β1 and let it equal to 0

    • eβ1=i2(yiβ1xiβ0)(xi)=0

    • i2(yiβ1xiβ0)(xi)=0i(β1xi2+β0xixiyi)=0

    • Replace β0 with (y¯β1x¯) : i(β1xi2+(y¯β1x¯)xixiyi)=0

    • β1(ixiyiy¯ixi)=ixi2x¯ixiβ1=ixiyiy¯ixiixi2x¯ixi

    • According to the Summation Property (As shown below):

    image.png

    • We will have β1=Cov(X,Y)Var(X)

2.1.2. Assessing the Accuracy of Coefficient Estimates#

  • SE(β1)2=σ2i=1n(xx¯)2

  • SE(β0)2=σ2[1n+x¯2i=1n(xix¯)2]

    • Where σ2=Var(e)

  • These two standard errors can be used to compute confidence interval, for example, for 95% confidence interval, it has the form [β12SE(β1), β1+2SE(β1)]

2.1.3. Hypothesis Testing#

  • Standard errors can be used to perform hypothesis tests on coefficients.

  • To test the null hypothesis, we compute a t-statistic, given by
    t=β10SE(β1)

    • This value follows a t-distribution with n-2 degrees of freedom

    • H0 assumes β1=0

    • Since H0:β1=0, [β12SE(β1), β1+2SE(β1)] should not contain 0

image.png

2.1.4. Assessing the Overall Accuracy of the Model#

  • We compute the Residual Standard Error

    RSE=1n2RSS=1n2in(yiy^i)2

    • Where RSS is the residual sum-of-squares

  • We can also use R-squared (fraction of variance explained):

    R2=TSSRSSTSS=1RSSTSS

    • Where TSS=_i=1n(yiy¯)2, is the total sum of squares

    • Also, In the simple linear regression setting, R2=r2 where r is the correlation between X and Y:

    r=i=12(xix¯)(yiy¯)i=12(xix¯)2i=12(yiy¯)2

    image.png

# Example data
x = np.array([1., 2., 3., 4., 5.]).reshape(-1, 1)
y = np.array([2., 4., 5., 8., 7.])

# Calculating means
x_mean = np.mean(x)
y_mean = np.mean(y)

# Calculating Beta_1
numerator = sum([i*j for i,j in zip(x-x_mean,y-y_mean)])
denominator = np.sum((x - x_mean)**2)
beta_1 = numerator / denominator

print("Beta_1 (slope) using OLS:", beta_1)
Beta_1 (slope) using OLS: [1.4]

2.1.5. Maximum likelihood estimation#

  • In the context of linear regression, MLE assumes that the residuals (differences between observed and predicted values) are normally distributed.

  • The method finds the parameter values that maximize the likelihood of observing the given data.

2.2. Multiple Linear Regression#

Y=β0+β1X1+β2X2+...+βpXp+e

  • Correlations amongst predictors cause problems (multicollinearity):

    • The variance of all coefficient tends to increase, sometimes dramatically.

    • t=β10SE(β1), If SE(β1) becomes larger, will contributes to a t closer to 0, which will lead to a larger p-value

    • Also, it’s hard to interpret.

  • Claims of causality should be avoided!

image.png

image.png

2.2.1. Important Question (Hypothesis testing)#

  1. Is at least one of the predictors X1,X2,...,Xp useful in predicting the response?

    • For this question, we use the Fstatistic

    • F=(TSSRSS)/pRSS/(np1)~Fp,np1

      • Where n is the number of observations, p is the number of predictors

    • H0: None of these predictors are useful

    image.png

    • If H0 is false, we expect F>1

  2. Do all the predictors help to explain Y, or is only a subset of the predictors useful?

    • Forward Selection

      • Begin with the null model

      • Fit p simple linear regression and add the null model the variable results in the lowest RSS

      • Add to that model the variable that results in the lowest RSS amongst all two-variable models.

      • Continue until stopping rules is satisfied (e.g. p-value >0.05 for all remaining variables)

    • Backward Selection

    image.png

    • Model Selection

      • Besides RSS, there are some other criteria for choosing an “optimal” member in stepwise searching, including Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted R-squared

  3. How well does the model fit the data?

  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

2.2.2. Polynomial regression (non-linear effects)#

image.png

image.png

2.3. Interesting Quotes by famous Statisticians#

  • Essentially, all models are wrong, but some are useful

    • George Box

  • The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively

    • Fred Mosteller and John Tukey, paraphrasing George Box