C7 Beyond Linear
Beyond Linearity¶
- The truth is not always linear
Polynomial regression¶
$$ y_1=\beta_0+\beta_1x_1 + \beta_2x_1^2+\beta_3x_1^3+...+\beta_dx_1^d+e $$
Step Function¶
- Using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of X.
- We can instead use
step functions
in order to avoid imposing such a global structure.- This amounts to converting a
continuous
variable into anordered
categorical
variable.
- This amounts to converting a
- Divide the target variable into discrete classes
- Can use
Breakpoint
test statistics to help this transforming.
- Can use
Break Point Statistics¶
Breakpoint checking
is a statistical technique used to identifystructural changes
orbreakpoints
in data series.- These changes could indicate shifts in underlying processes, trends, or relationships within the data.
Chow Test¶
- The chow test statistic is calculated as follows:
- $RSS_r$ is the residual sum of squares from the model with all data combined.
- $RSS_1$ and $RSS_2$ are the residual sum of squares from the separate models for each segment.
- $k$ is the number of parameters estimated in each segment.
- $n_1$ and $n_2$ are the number of observations in each segment $$ F=\frac{RSS_r-(RSS_1+RSS_2)/k}{(RSS_1+RSS_2)/(n_1+n_2-2k)} $$
In [3]:
Copied!
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
In [6]:
Copied!
# Generate two sections of data with different linear functions
np.random.seed(0)
# First section
n1 = 50
x1 = np.linspace(0, 5, n1)
y1 = 2 * x1 + np.random.normal(0, 1, n1)
# Second section
n2 = 50
x2 = np.linspace(5, 10, n2)
y2 = 3 * x2 + np.random.normal(0, 1, n2)
# Merge the sections
x = np.concatenate([x1, x2])
y = np.concatenate([y1, y2])
# Initialize variables to store best breakpoint and test statistic
best_breakpoint = None
best_F = 0
# Iterate over potential breakpoints
for i in range(1, len(x)-1):
x1 = x[:i]
y1 = y[:i]
x2 = x[i:]
y2 = y[i:]
# Fit model with all data combined
X_combined = sm.add_constant(np.concatenate([x1, x2]))
model_combined = sm.OLS(np.concatenate([y1, y2]), X_combined).fit()
RSS_r = model_combined.ssr
k = 2 # Number of parameters estimated in each segment
# Fit separate models for each segment
model1 = sm.OLS(y1, sm.add_constant(x1)).fit()
model2 = sm.OLS(y2, sm.add_constant(x2)).fit()
RSS_1 = model1.ssr
RSS_2 = model2.ssr
n1 = len(x1)
n2 = len(x2)
# Calculate Chow test statistic
F_chow = ((RSS_r - (RSS_1 + RSS_2)) / k) / ((RSS_1 + RSS_2) / (n1 + n2 - 2*k))
# Update best breakpoint if necessary
if F_chow > best_F:
best_F = F_chow
best_breakpoint = i
# Generate two sections of data with different linear functions
np.random.seed(0)
# First section
n1 = 50
x1 = np.linspace(0, 5, n1)
y1 = 2 * x1 + np.random.normal(0, 1, n1)
# Second section
n2 = 50
x2 = np.linspace(5, 10, n2)
y2 = 3 * x2 + np.random.normal(0, 1, n2)
# Merge the sections
x = np.concatenate([x1, x2])
y = np.concatenate([y1, y2])
# Initialize variables to store best breakpoint and test statistic
best_breakpoint = None
best_F = 0
# Iterate over potential breakpoints
for i in range(1, len(x)-1):
x1 = x[:i]
y1 = y[:i]
x2 = x[i:]
y2 = y[i:]
# Fit model with all data combined
X_combined = sm.add_constant(np.concatenate([x1, x2]))
model_combined = sm.OLS(np.concatenate([y1, y2]), X_combined).fit()
RSS_r = model_combined.ssr
k = 2 # Number of parameters estimated in each segment
# Fit separate models for each segment
model1 = sm.OLS(y1, sm.add_constant(x1)).fit()
model2 = sm.OLS(y2, sm.add_constant(x2)).fit()
RSS_1 = model1.ssr
RSS_2 = model2.ssr
n1 = len(x1)
n2 = len(x2)
# Calculate Chow test statistic
F_chow = ((RSS_r - (RSS_1 + RSS_2)) / k) / ((RSS_1 + RSS_2) / (n1 + n2 - 2*k))
# Update best breakpoint if necessary
if F_chow > best_F:
best_F = F_chow
best_breakpoint = i
In [7]:
Copied!
# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label='Data')
plt.axvline(x=x[best_breakpoint], color='r', linestyle='--', label=f'Best Breakpoint at {x[best_breakpoint]:.2f}')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Chow Test: Best Breakpoint')
plt.legend()
plt.grid(True)
plt.show()
print("Best Breakpoint:", x[best_breakpoint])
print("Chow Test Statistic:", best_F)
# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label='Data')
plt.axvline(x=x[best_breakpoint], color='r', linestyle='--', label=f'Best Breakpoint at {x[best_breakpoint]:.2f}')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Chow Test: Best Breakpoint')
plt.legend()
plt.grid(True)
plt.show()
print("Best Breakpoint:", x[best_breakpoint])
print("Chow Test Statistic:", best_F)
Best Breakpoint: 5.0 Chow Test Statistic: 157.58585701856043
Bai-Perron Test¶
- It uses a
dynamic programming
approach to efficiently identify the number and locations of breakpoints by minimizing the sum of squared residuals. - Steps:
- Let $y_t = X_t^T β_j + ε_t$ be the linear regression model with $m$ breakpoints
- sup $F_T(λ_1, ..., λ_m)$, where $λ_j = T_j/T$ are the breakpoint fractions
- And $F_T$ is a function of the
sum of squared residuals
for different partitions of the sample.
- And $F_T$ is a function of the
In [11]:
Copied!
from statsmodels.stats.diagnostic import breaks_cusumolsresid
from statsmodels.stats.diagnostic import breaks_cusumolsresid
In [26]:
Copied!
# Fit the model
X_combined = sm.add_constant(x)
model_combined = sm.OLS(y, X_combined).fit()
residuals = model_combined.resid
# Perform Bai-Perron Test
breakpoints = breaks_cusumolsresid(residuals)
# Visualize Bai-Perron Test
plt.figure(figsize=(10, 6))
plt.plot(x, residuals, label='Residuals')
plt.axvline(x=x[np.argmin(residuals)], color='r', linestyle='--', label=f'Best Breakpoint at {x[np.argmin(residuals)]:.2f}')
plt.xlabel('x')
plt.ylabel('Residuals')
plt.title('Bai-Perron Test')
plt.legend()
plt.grid(True)
plt.show()
# print("Detected Breakpoints:", [x[bp] for bp in breakpoints])
# Fit the model
X_combined = sm.add_constant(x)
model_combined = sm.OLS(y, X_combined).fit()
residuals = model_combined.resid
# Perform Bai-Perron Test
breakpoints = breaks_cusumolsresid(residuals)
# Visualize Bai-Perron Test
plt.figure(figsize=(10, 6))
plt.plot(x, residuals, label='Residuals')
plt.axvline(x=x[np.argmin(residuals)], color='r', linestyle='--', label=f'Best Breakpoint at {x[np.argmin(residuals)]:.2f}')
plt.xlabel('x')
plt.ylabel('Residuals')
plt.title('Bai-Perron Test')
plt.legend()
plt.grid(True)
plt.show()
# print("Detected Breakpoints:", [x[bp] for bp in breakpoints])
Liner Splines¶
- A linear spline with knots at $k=1,...,K$ is a piecewise linear polynomial continuous at each knot.
- where the $b_k$ are basis functions $$ y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i)+...+\beta_Kb_K(x_i) + e_i $$
Smoothing Splines¶
- It consider the following criterion for fitting a smooth function $g(x)$ to data:
- The first term is RSS, and try to make $g(x)$ match the data at each $x$
- The second term is 2nd derivatives, it constraint the smoothness
- $\lambda$ is the tuning parameter, when $\lambda$ goes infinity, the 2nd derivatives has to be 0, which means a linear function (No Concave-Convexity) $$ minimize_{g\in S} \sum_{i=1}^n(y_i-g(x_i))^2+\lambda \int g''(t)^2dt $$