8. Logistic Regression#

8.1. 1. Introduction to Classification#

In Linear Regression, we predict a continuous quantitative value (e.g., house price). However, in many real-world problems, we want to predict a category or class.

Examples:

  • Email: Is this email Spam or Not Spam?

  • Medical: Does this patient have Heart Disease or Not?

  • Finance: Will this customer Default on their loan or Not?

These are Classification problems. The response variable \(Y\) is qualitative (e.g., \(Y \in \{0, 1\}\)).

8.2. 2. Why not Linear Regression?#

You might be tempted to use Linear Regression for a binary outcome (0 or 1). However, this has major issues:

  1. Unbounded Output: Linear regression can predict values like -0.5 or 1.2, which don’t make sense as probabilities.

  2. Violates Assumptions: The errors are not normally distributed.

Instead, we model the Probability that \(Y\) belongs to a particular category.

8.3. 3. The Logistic Function#

To ensure our prediction falls between 0 and 1, we use the Logistic Function (Sigmoid function).

\[ p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \]
  • If \(\beta_0 + \beta_1 X\) is very large positive, \(p(X) \approx 1\).

  • If \(\beta_0 + \beta_1 X\) is very large negative, \(p(X) \approx 0\).

This creates an S-shaped curve rather than a straight line.

8.3.1. Log-Odds (Logit)#

By rearranging the equation, we get linear relationship with the log-odds:

\[ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X \]
  • The quantity \(\frac{p(X)}{1-p(X)}\) is called the Odds (e.g., 4:1 odds means 80% probability).

  • Increasing \(X\) by one unit changes the log-odds by \(\beta_1\).

8.4. 4. Estimating Coefficients (Maximum Likelihood)#

Unlike Linear Regression which uses Least Squares (minimizing error), Logistic Regression uses Maximum Likelihood Estimation (MLE).

Intuition: We search for \(\beta_0\) and \(\beta_1\) such that the predicted probabilities correspond as closely as possible to the observed individuals.

  • If a person Has Disease (\(Y=1\)), we want their \(p(X)\) to be close to 1.

  • If a person No Disease (\(Y=0\)), we want their \(p(X)\) to be close to 0.

8.5. 5. Implementation Example#

We will use sklearn to implement Logistic Regression on a synthetic dataset.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# 1. Generate synthetic classification data
# 1000 samples, 1 feature for easy visualization
X, y = make_classification(n_samples=1000, n_features=1, n_informative=1, 
                           n_redundant=0, n_clusters_per_class=1, random_state=42)

# 2. Split into Training and Testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Fit the Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# 4. Making Predictions
# predict_proba gives the probability (e.g., 0.85)
y_prob = log_reg.predict_proba(X_test)[:, 1]
# predict gives the class label (e.g., 1)
y_pred = log_reg.predict(X_test)

# 5. Evaluation
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

8.5.1. Visualizing the S-Curve#

Below, we visualize how the logistic regression fits the data points (red) with a smooth sigmoid curve (blue). The decision boundary is at Probability = 0.5.

plt.figure(figsize=(10, 6))

# Scatter plot of actual test data (0 or 1)
plt.scatter(X_test, y_test, color='red', alpha=0.3, label='Test Data Points')

# Generate a range of X values to plot the smooth curve
X_range = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)
y_range_prob = log_reg.predict_proba(X_range)[:, 1]

# Plot the Logistic Function
plt.plot(X_range, y_range_prob, color='blue', linewidth=3, label='Logistic Sigmoid Curve')

# Decision Boundary (0.5 probability)
plt.axhline(0.5, color='gray', linestyle='--', label='Decision Boundary (P=0.5)')

plt.xlabel('Feature Value')
plt.ylabel('Probability of Class 1')
plt.legend()
plt.title('Logistic Regression Fit')
plt.show()

8.6. 6. Quiz#

Test your understanding of Logistic Regression.

Q1. What is the range of output for the Logistic Function? A) \((-\infty, +\infty)\) B) \([0, 1]\) C) \([-1, 1]\)

Q2. If the coefficient \(\beta_1\) is positive, what does it imply? A) Increasing X increases the probability of \(Y=1\). B) Increasing X decreases the probability of \(Y=1\). C) X has no effect on Y.

Q3. Which method is used to estimate parameters in Logistic Regression? A) Least Squares B) Maximum Likelihood Estimation (MLE) C) Gini Index


8.6.1. Sample Answers#

Q1: B) \([0, 1]\). Probabilities must always be between 0 and 1. Q2: A). A positive coefficient means the log-odds (and thus probability) increase as X increases. Q3: B). MLE is used to find the parameters that maximize the probability of the observed data.