Python Machine Learning – Linear Regression
Linear Regression is a simple yet powerful technique in machine learning and statistics that models the relationship between two variables by fitting a linear equation to the observed data. It’s primarily used for predicting continuous outcomes.
In linear regression:
- Dependent variable (target): The variable we want to predict.
- Independent variable (feature): The variable(s) we use to make the prediction.
1. Linear Regression Equation
The equation of a straight line in linear regression is represented as:
Where:
- is the predicted value (dependent variable).
- is the slope of the line (how much changes for each unit increase in ).
- is the independent variable (feature).
- is the y-intercept (the value of when ).
When there are multiple independent variables (features), the linear regression equation becomes:
Where:
- is the dependent variable (target),
- are the independent variables (features),
- is the intercept,
- are the coefficients (slopes) for the respective features.
2. Simple Linear Regression in Python
The scikit-learn library is commonly used in Python for implementing linear regression models.
Step-by-Step Example:
Let’s create a simple linear regression model in Python using scikit-learn to predict a continuous outcome.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Example data: Feature (X) and target (Y)
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Reshape needed for sklearn
Y = np.array([1, 2, 3, 3.5, 4.5])
# Create a Linear Regression model
model = LinearRegression()
# Train the model
model.fit(X, Y)
# Make predictions
Y_pred = model.predict(X)
# Plot the data and the regression line
plt.scatter(X, Y, color='blue') # Scatter plot of actual data
plt.plot(X, Y_pred, color='red') # Line plot of predicted values
plt.title('Simple Linear Regression')
plt.xlabel('Feature X')
plt.ylabel('Target Y')
plt.show()
# Print the coefficients
print('Slope (m):', model.coef_[0])
print('Intercept (b):', model.intercept_)
In this example:
- We use a simple dataset with a single feature and a target .
model.fit(X, Y)trains the linear regression model on the data.model.predict(X)makes predictions using the trained model.- The scatter plot shows the actual data points, and the red line represents the predicted values (regression line).
Output:
You will see a scatter plot of the data points with a straight line passing through them, representing the linear regression model. The slope and intercept of the line will also be printed.
3. Evaluating the Model
To assess how well the linear regression model fits the data, we use evaluation metrics like:
- Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values.
Where is the actual value and is the predicted value.
- R-squared (Coefficient of Determination): Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It ranges from 0 to 1, with 1 indicating a perfect fit.
Where is the mean of the actual values.
Example: Evaluating the Model
from sklearn.metrics import mean_squared_error, r2_score
# Calculate mean squared error
mse = mean_squared_error(Y, Y_pred)
# Calculate R-squared
r2 = r2_score(Y, Y_pred)
print('Mean Squared Error (MSE):', mse)
print('R-squared:', r2)
4. Multiple Linear Regression
Multiple linear regression is an extension of simple linear regression where there are two or more independent variables. The concept is the same, but we have multiple input features.
Example: Multiple Linear Regression
import numpy as np
from sklearn.linear_model import LinearRegression
# Example data: Two features and target
X = np.array([[1, 1], [2, 1], [3, 2], [4, 3], [5, 3]])
Y = np.array([1, 2, 3, 3.5, 4.5])
# Create a Linear Regression model
model = LinearRegression()
# Train the model
model.fit(X, Y)
# Make predictions
Y_pred = model.predict(X)
# Print the coefficients and intercept
print('Coefficients (m1, m2):', model.coef_)
print('Intercept (b):', model.intercept_)
In this example:
Xnow contains two features.model.coef_returns the slopes (coefficients) for each feature.- The model still fits a line (or plane, in the case of two features) to the data.
5. Visualizing Multiple Linear Regression
Visualizing multiple linear regression is more challenging due to the higher dimensionality. However, for two features, you can plot a 3D surface to represent the model.
Example: 3D Plot for Multiple Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LinearRegression
# Example data
X = np.array([[1, 1], [2, 1], [3, 2], [4, 3], [5, 3]])
Y = np.array([1, 2, 3, 3.5, 4.5])
# Create a Linear Regression model
model = LinearRegression()
model.fit(X, Y)
# Make predictions
Y_pred = model.predict(X)
# Create a 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot the actual data points
ax.scatter(X[:, 0], X[:, 1], Y, color='blue')
# Plot the predicted values (fitted plane)
ax.plot_trisurf(X[:, 0], X[:, 1], Y_pred, color='red', alpha=0.5)
# Labels and title
ax.set_xlabel('Feature X1')
ax.set_ylabel('Feature X2')
ax.set_zlabel('Target Y')
ax.set_title('Multiple Linear Regression')
plt.show()
This code generates a 3D scatter plot of the actual data points and a 3D surface representing the predicted values.
6. Assumptions of Linear Regression
Before applying linear regression, it’s important to ensure that the following assumptions are met:
- Linearity: There should be a linear relationship between the independent and dependent variables.
- Homoscedasticity: The variance of residuals (errors) should be constant across all levels of the independent variables.
- Independence: Observations should be independent of each other.
- Normality of Errors: The residuals should follow a normal distribution.
- No Multicollinearity: Independent variables should not be highly correlated with each other.
7. Applications of Linear Regression
Linear regression is widely used in various fields, including:
- Predictive modeling: To predict outcomes like house prices, sales, etc.
- Risk assessment: Estimating the risk factors for insurance and finance.
- Econometrics: Understanding relationships between economic indicators.
- Marketing: Predicting the effect of advertising spend on sales.
Conclusion
Linear regression is a fundamental machine learning algorithm used for regression tasks. It models the linear relationship between features and the target variable, providing an interpretable equation that can be used to make predictions. Whether it’s simple or multiple linear regression, Python’s scikit-learn makes it easy to implement and evaluate linear regression models.