Applying Multiple Linear Regression with Python

Linear regression serves as a statistical tool utilized in predictive analysis, establishing a linear equation to model the relationship between a dependent variable and one or more independent variables. When the model includes multiple independent variables, it is termed Multiple Linear Regression. This approach helps in understanding how different features together impact the outcomes.

Process of Multiple Linear Regression

The methodology for conducting multiple linear regression mirrors that of simple linear regression, with the main distinction lying in the evaluation phase. This technique is instrumental in identifying the most influential factors on the predicted result and analyzing the interrelation of different variables. The equation representing multiple linear regression is:

[ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n ]

Where:

( y ) represents the dependent variable
( X_1, X_2, \cdots, X_n ) denote the independent variables
( \beta_0 ) is the intercept
( \beta_1, \beta_2, \cdots, \beta_n ) are the slopes

The aim is to discover the best-fit line equation that can predict values based on the independent variables. A regression model learns from a dataset with known ( X ) and ( y ) values and applies this knowledge to predict ( y ) values for unknown ( X ).

Handling Categorical Data with Dummy Variables

In multiple regression models, categorical data such as gender or location might be encountered. Since regression models require numerical inputs, categorical data must be converted into a usable format. Dummy variables, which are binary (0 or 1), are employed for this transformation. For example:

Male: 1 if male, 0 otherwise
Female: 1 if female, 0 otherwise

In situations involving multiple categories, a dummy variable is created for each category, excluding one, to prevent multicollinearity—a process known as one-hot encoding.

Multicollinearity in Multiple Linear Regression

Multicollinearity occurs when two or more independent variables are highly correlated, complicating the determination of each variable's individual contribution to the dependent variable.

Methods to detect multicollinearity include:

Correlation Matrix: This tool helps identify relationships between independent variables; strong correlations suggest multicollinearity.
Variance Inflation Factor (VIF): VIF measures the increase in variance of a regression coefficient when predictors are correlated. A VIF above 10 often indicates multicollinearity.

Assumptions of Multiple Regression Model

Similar to simple linear regression, several assumptions apply to multiple linear regression:

Linearity: The relationship between dependent and independent variables should be linear.
Homoscedasticity: The variance of errors must be constant across all levels of independent variables.
Multivariate Normality: Residuals should be normally distributed.
No Multicollinearity: Independent variables should not be highly correlated.

Implementing a Multiple Linear Regression Model

For demonstration, the California Housing dataset, which includes features like median income and average rooms, is used to predict house prices.

Step 1: Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

Step 2: Loading Dataset

Load the California Housing data, storing features like median income and average rooms in X, and house prices in y.

california_housing = fetch_california_housing()
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = pd.Series(california_housing.target)

Step 3: Selecting Features for Visualization

Choose two features, MedInc (median income) and AveRooms (average rooms), for simplified two-dimensional visualization.

X = X[['MedInc', 'AveRooms']]

Step 4: Train-Test Split

Utilize 80% of the data for training and 20% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Initializing and Training Model

Create and train a multiple linear regression model using scikit-learn's LinearRegression.

model = LinearRegression()
model.fit(X_train, y_train)

Step 6: Finding Intercept and Slopes

After training, access the regression equation's intercept and coefficients.

print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

Output:

Intercept: 0.5972677793933272
Coefficients: [0.43626089, -0.04017161]

Step 7: Making Predictions

Use the trained model to predict house prices on the test data.

y_pred = model.predict(X_test)

Step 8: Visualizing Best Fit Line in 3D

Plot a 3D graph showing blue points for actual house prices based on MedInc and AveRooms, and a red surface for the best-fit plane predicted by the model.

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_test['MedInc'], X_test['AveRooms'], y_test, color='blue', label='Actual Data')

x1_range = np.linspace(X_test['MedInc'].min(), X_test['MedInc'].max(), 100)
x2_range = np.linspace(X_test['AveRooms'].min(), X_test['AveRooms'].max(), 100)
x1, x2 = np.meshgrid(x1_range, x2_range)

z = model.predict(np.c_[x1.ravel(), x2.ravel()]).reshape(x1.shape)

ax.plot_surface(x1, x2, z, color='red', alpha=0.5, rstride=100, cstride=100)

ax.set_xlabel('Median Income')
ax.set_ylabel('Average Rooms')
ax.set_zlabel('House Price')
ax.set_title('Multiple Linear Regression Best Fit Line (3D)')

plt.show()

Multiple Linear Regression effectively demonstrates how several factors collectively influence a target variable, offering a practical approach for predictive modeling in real-world applications.