8. Logistic Regression#
8.1. 1. Introduction to Classification#
In Linear Regression, we predict a continuous quantitative value (e.g., house price). However, in many real-world problems, we want to predict a category or class.
Examples:
Email: Is this email Spam or Not Spam?
Medical: Does this patient have Heart Disease or Not?
Finance: Will this customer Default on their loan or Not?
These are Classification problems. The response variable \(Y\) is qualitative (e.g., \(Y \in \{0, 1\}\)).
8.2. 2. Why not Linear Regression?#
You might be tempted to use Linear Regression for a binary outcome (0 or 1). However, this has major issues:
Unbounded Output: Linear regression can predict values like -0.5 or 1.2, which don’t make sense as probabilities.
Violates Assumptions: The errors are not normally distributed.
Instead, we model the Probability that \(Y\) belongs to a particular category.
8.3. 3. The Logistic Function#
To ensure our prediction falls between 0 and 1, we use the Logistic Function (Sigmoid function).
If \(\beta_0 + \beta_1 X\) is very large positive, \(p(X) \approx 1\).
If \(\beta_0 + \beta_1 X\) is very large negative, \(p(X) \approx 0\).
This creates an S-shaped curve rather than a straight line.
8.3.1. Log-Odds (Logit)#
By rearranging the equation, we get linear relationship with the log-odds:
The quantity \(\frac{p(X)}{1-p(X)}\) is called the Odds (e.g., 4:1 odds means 80% probability).
Increasing \(X\) by one unit changes the log-odds by \(\beta_1\).
8.4. 4. Estimating Coefficients (Maximum Likelihood)#
Unlike Linear Regression which uses Least Squares (minimizing error), Logistic Regression uses Maximum Likelihood Estimation (MLE).
Intuition: We search for \(\beta_0\) and \(\beta_1\) such that the predicted probabilities correspond as closely as possible to the observed individuals.
If a person Has Disease (\(Y=1\)), we want their \(p(X)\) to be close to 1.
If a person No Disease (\(Y=0\)), we want their \(p(X)\) to be close to 0.
8.5. 5. Implementation Example#
We will use sklearn to implement Logistic Regression on a synthetic dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
# 1. Generate synthetic classification data
# 1000 samples, 1 feature for easy visualization
X, y = make_classification(n_samples=1000, n_features=1, n_informative=1,
n_redundant=0, n_clusters_per_class=1, random_state=42)
# 2. Split into Training and Testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 3. Fit the Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# 4. Making Predictions
# predict_proba gives the probability (e.g., 0.85)
y_prob = log_reg.predict_proba(X_test)[:, 1]
# predict gives the class label (e.g., 1)
y_pred = log_reg.predict(X_test)
# 5. Evaluation
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
8.5.1. Visualizing the S-Curve#
Below, we visualize how the logistic regression fits the data points (red) with a smooth sigmoid curve (blue). The decision boundary is at Probability = 0.5.
plt.figure(figsize=(10, 6))
# Scatter plot of actual test data (0 or 1)
plt.scatter(X_test, y_test, color='red', alpha=0.3, label='Test Data Points')
# Generate a range of X values to plot the smooth curve
X_range = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)
y_range_prob = log_reg.predict_proba(X_range)[:, 1]
# Plot the Logistic Function
plt.plot(X_range, y_range_prob, color='blue', linewidth=3, label='Logistic Sigmoid Curve')
# Decision Boundary (0.5 probability)
plt.axhline(0.5, color='gray', linestyle='--', label='Decision Boundary (P=0.5)')
plt.xlabel('Feature Value')
plt.ylabel('Probability of Class 1')
plt.legend()
plt.title('Logistic Regression Fit')
plt.show()
8.6. 6. Quiz#
Test your understanding of Logistic Regression.
Q1. What is the range of output for the Logistic Function? A) \((-\infty, +\infty)\) B) \([0, 1]\) C) \([-1, 1]\)
Q2. If the coefficient \(\beta_1\) is positive, what does it imply? A) Increasing X increases the probability of \(Y=1\). B) Increasing X decreases the probability of \(Y=1\). C) X has no effect on Y.
Q3. Which method is used to estimate parameters in Logistic Regression? A) Least Squares B) Maximum Likelihood Estimation (MLE) C) Gini Index
8.6.1. Sample Answers#
Q1: B) \([0, 1]\). Probabilities must always be between 0 and 1. Q2: A). A positive coefficient means the log-odds (and thus probability) increase as X increases. Q3: B). MLE is used to find the parameters that maximize the probability of the observed data.