Python Machine Learning – Linear Regression

Linear Regression is a simple yet powerful technique in machine learning and statistics that models the relationship between two variables by fitting a linear equation to the observed data. It’s primarily used for predicting continuous outcomes.

In linear regression:

  • Dependent variable (target): The variable we want to predict.
  • Independent variable (feature): The variable(s) we use to make the prediction.

1. Linear Regression Equation

The equation of a straight line in linear regression is represented as:

y=mx+by = mx + b

Where:

  • yy is the predicted value (dependent variable).
  • mm is the slope of the line (how much yy changes for each unit increase in xx).
  • xx is the independent variable (feature).
  • bb is the y-intercept (the value of yy when x=0x = 0).

When there are multiple independent variables (features), the linear regression equation becomes:

y=β0+β1x1+β2x2++βnxny = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n

Where:

  • yy is the dependent variable (target),
  • x1,x2,,xnx_1, x_2, \dots, x_n are the independent variables (features),
  • β0\beta_0 is the intercept,
  • β1,β2,,βn\beta_1, \beta_2, \dots, \beta_n are the coefficients (slopes) for the respective features.

2. Simple Linear Regression in Python

The scikit-learn library is commonly used in Python for implementing linear regression models.

Step-by-Step Example:

Let’s create a simple linear regression model in Python using scikit-learn to predict a continuous outcome.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Example data: Feature (X) and target (Y)
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # Reshape needed for sklearn
Y = np.array([1, 2, 3, 3.5, 4.5])

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X, Y)

# Make predictions
Y_pred = model.predict(X)

# Plot the data and the regression line
plt.scatter(X, Y, color='blue')  # Scatter plot of actual data
plt.plot(X, Y_pred, color='red')  # Line plot of predicted values
plt.title('Simple Linear Regression')
plt.xlabel('Feature X')
plt.ylabel('Target Y')
plt.show()

# Print the coefficients
print('Slope (m):', model.coef_[0])
print('Intercept (b):', model.intercept_)

In this example:

  • We use a simple dataset with a single feature XX and a target YY.
  • model.fit(X, Y) trains the linear regression model on the data.
  • model.predict(X) makes predictions using the trained model.
  • The scatter plot shows the actual data points, and the red line represents the predicted values (regression line).

Output:

You will see a scatter plot of the data points with a straight line passing through them, representing the linear regression model. The slope and intercept of the line will also be printed.

3. Evaluating the Model

To assess how well the linear regression model fits the data, we use evaluation metrics like:

  • Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values.

    MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2Where yiy_i is the actual value and y^i\hat{y}_i is the predicted value.

  • R-squared (Coefficient of Determination): Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It ranges from 0 to 1, with 1 indicating a perfect fit.

    R2=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2 = 1 – \frac{\sum_{i=1}^{n} (y_i – \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i – \bar{y})^2}Where yˉ\bar{y} is the mean of the actual values.

Example: Evaluating the Model

from sklearn.metrics import mean_squared_error, r2_score

# Calculate mean squared error
mse = mean_squared_error(Y, Y_pred)

# Calculate R-squared
r2 = r2_score(Y, Y_pred)

print('Mean Squared Error (MSE):', mse)
print('R-squared:', r2)

4. Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression where there are two or more independent variables. The concept is the same, but we have multiple input features.

Example: Multiple Linear Regression

import numpy as np
from sklearn.linear_model import LinearRegression

# Example data: Two features and target
X = np.array([[1, 1], [2, 1], [3, 2], [4, 3], [5, 3]])
Y = np.array([1, 2, 3, 3.5, 4.5])

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X, Y)

# Make predictions
Y_pred = model.predict(X)

# Print the coefficients and intercept
print('Coefficients (m1, m2):', model.coef_)
print('Intercept (b):', model.intercept_)

In this example:

  • X now contains two features.
  • model.coef_ returns the slopes (coefficients) for each feature.
  • The model still fits a line (or plane, in the case of two features) to the data.

5. Visualizing Multiple Linear Regression

Visualizing multiple linear regression is more challenging due to the higher dimensionality. However, for two features, you can plot a 3D surface to represent the model.

Example: 3D Plot for Multiple Linear Regression

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LinearRegression

# Example data
X = np.array([[1, 1], [2, 1], [3, 2], [4, 3], [5, 3]])
Y = np.array([1, 2, 3, 3.5, 4.5])

# Create a Linear Regression model
model = LinearRegression()
model.fit(X, Y)

# Make predictions
Y_pred = model.predict(X)

# Create a 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plot the actual data points
ax.scatter(X[:, 0], X[:, 1], Y, color='blue')

# Plot the predicted values (fitted plane)
ax.plot_trisurf(X[:, 0], X[:, 1], Y_pred, color='red', alpha=0.5)

# Labels and title
ax.set_xlabel('Feature X1')
ax.set_ylabel('Feature X2')
ax.set_zlabel('Target Y')
ax.set_title('Multiple Linear Regression')

plt.show()

This code generates a 3D scatter plot of the actual data points and a 3D surface representing the predicted values.

6. Assumptions of Linear Regression

Before applying linear regression, it’s important to ensure that the following assumptions are met:

  • Linearity: There should be a linear relationship between the independent and dependent variables.
  • Homoscedasticity: The variance of residuals (errors) should be constant across all levels of the independent variables.
  • Independence: Observations should be independent of each other.
  • Normality of Errors: The residuals should follow a normal distribution.
  • No Multicollinearity: Independent variables should not be highly correlated with each other.

7. Applications of Linear Regression

Linear regression is widely used in various fields, including:

  • Predictive modeling: To predict outcomes like house prices, sales, etc.
  • Risk assessment: Estimating the risk factors for insurance and finance.
  • Econometrics: Understanding relationships between economic indicators.
  • Marketing: Predicting the effect of advertising spend on sales.

Conclusion

Linear regression is a fundamental machine learning algorithm used for regression tasks. It models the linear relationship between features and the target variable, providing an interpretable equation that can be used to make predictions. Whether it’s simple or multiple linear regression, Python’s scikit-learn makes it easy to implement and evaluate linear regression models.

Leave a Reply 0

Your email address will not be published. Required fields are marked *