Applying Multiple Linear Regression with Python
serves as a statistical tool utilized in predictive analysis, establishing a linear equation to model the relationship between a dependent variable and one or more independent v...
Linear regression serves as a statistical tool utilized in predictive analysis, establishing a linear equation to model the relationship between a dependent variable and one or more independent variables. When the model includes multiple independent variables, it is termed Multiple Linear Regression. This approach helps in understanding how different features together impact the outcomes.
Process of Multiple Linear Regression
The methodology for conducting multiple linear regression mirrors that of simple linear regression, with the main distinction lying in the evaluation phase. This technique is instrumental in identifying the most influential factors on the predicted result and analyzing the interrelation of different variables. The equation representing multiple linear regression is:
[ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n ]
Where:
- ( y ) represents the dependent variable
- ( X_1, X_2, \cdots, X_n ) denote the independent variables
- ( \beta_0 ) is the intercept
- ( \beta_1, \beta_2, \cdots, \beta_n ) are the slopes
The aim is to discover the best-fit line equation that can predict values based on the independent variables. A regression model learns from a dataset with known ( X ) and ( y ) values and applies this knowledge to predict ( y ) values for unknown ( X ).
Handling Categorical Data with Dummy Variables
In multiple regression models, categorical data such as gender or location might be encountered. Since regression models require numerical inputs, categorical data must be converted into a usable format. Dummy variables, which are binary (0 or 1), are employed for this transformation. For example:
- Male: 1 if male, 0 otherwise
- Female: 1 if female, 0 otherwise
In situations involving multiple categories, a dummy variable is created for each category, excluding one, to prevent multicollinearity—a process known as one-hot encoding.
Multicollinearity in Multiple Linear Regression
Multicollinearity occurs when two or more independent variables are highly correlated, complicating the determination of each variable's individual contribution to the dependent variable.
Methods to detect multicollinearity include:
- Correlation Matrix: This tool helps identify relationships between independent variables; strong correlations suggest multicollinearity.
- Variance Inflation Factor (VIF): VIF measures the increase in variance of a regression coefficient when predictors are correlated. A VIF above 10 often indicates multicollinearity.
Assumptions of Multiple Regression Model
Similar to simple linear regression, several assumptions apply to multiple linear regression:
- Linearity: The relationship between dependent and independent variables should be linear.
- Homoscedasticity: The variance of errors must be constant across all levels of independent variables.
- Multivariate Normality: Residuals should be normally distributed.
- No Multicollinearity: Independent variables should not be highly correlated.
Implementing a Multiple Linear Regression Model
For demonstration, the California Housing dataset, which includes features like median income and average rooms, is used to predict house prices.
Step 1: Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
Step 2: Loading Dataset
Load the California Housing data, storing features like median income and average rooms in X, and house prices in y.
california_housing = fetch_california_housing()
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = pd.Series(california_housing.target)
Step 3: Selecting Features for Visualization
Choose two features, MedInc (median income) and AveRooms (average rooms), for simplified two-dimensional visualization.
X = X[['MedInc', 'AveRooms']]
Step 4: Train-Test Split
Utilize 80% of the data for training and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5: Initializing and Training Model
Create and train a multiple linear regression model using scikit-learn's LinearRegression.
model = LinearRegression()
model.fit(X_train, y_train)
Step 6: Finding Intercept and Slopes
After training, access the regression equation's intercept and coefficients.
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
Output:
- Intercept: 0.5972677793933272
- Coefficients: [0.43626089, -0.04017161]
Step 7: Making Predictions
Use the trained model to predict house prices on the test data.
y_pred = model.predict(X_test)
Step 8: Visualizing Best Fit Line in 3D
Plot a 3D graph showing blue points for actual house prices based on MedInc and AveRooms, and a red surface for the best-fit plane predicted by the model.
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_test['MedInc'], X_test['AveRooms'], y_test, color='blue', label='Actual Data')
x1_range = np.linspace(X_test['MedInc'].min(), X_test['MedInc'].max(), 100)
x2_range = np.linspace(X_test['AveRooms'].min(), X_test['AveRooms'].max(), 100)
x1, x2 = np.meshgrid(x1_range, x2_range)
z = model.predict(np.c_[x1.ravel(), x2.ravel()]).reshape(x1.shape)
ax.plot_surface(x1, x2, z, color='red', alpha=0.5, rstride=100, cstride=100)
ax.set_xlabel('Median Income')
ax.set_ylabel('Average Rooms')
ax.set_zlabel('House Price')
ax.set_title('Multiple Linear Regression Best Fit Line (3D)')
plt.show()
Multiple Linear Regression effectively demonstrates how several factors collectively influence a target variable, offering a practical approach for predictive modeling in real-world applications.