Python Machine Learning – Normal Data Distribution

1 year ago

6 minutes

In machine learning, a normal distribution (also known as a Gaussian distribution or a bell curve) is a probability distribution where most data points are concentrated around the mean, with fewer data points appearing as you move further from the mean in both directions. This distribution is symmetric and has the following characteristics:

The mean (average), median, and mode are all equal and located at the center of the distribution.
Approximately 68% of the data falls within one standard deviation of the mean.
About 95% of the data lies within two standard deviations of the mean.
Almost 99.7% of the data is within three standard deviations of the mean.

1. Generating a Normal Distribution in Python

You can use Python libraries like NumPy to generate normally distributed data.

Example: Generating Normal Distribution with NumPy

import numpy as np
import matplotlib.pyplot as plt

# Generate random data with a normal distribution
# mean = 50, standard deviation = 10, and 1000 data points
data = np.random.normal(50, 10, 1000)

# Plot a histogram of the data
plt.hist(data, bins=30, edgecolor='black')

# Add labels and title
plt.title('Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In this example:

np.random.normal(mean, std_dev, num_points) generates a dataset that follows a normal distribution with the specified mean and standard deviation.

Output:

The histogram will show a bell curve representing the normal distribution.

2. Probability Density Function (PDF)

The probability density function (PDF) describes the likelihood of different values in a normal distribution. For a normally distributed variable, the PDF is given by:

$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)$

Where:

$\mu$ is the mean,
$\sigma$ is the standard deviation,
$x$ is the value.

3. Standard Normal Distribution

A standard normal distribution is a special case of the normal distribution where:

The mean is 0.
The standard deviation is 1.

To convert any normal distribution to a standard normal distribution, you can use z-scores. The z-score of a data point tells you how many standard deviations the point is from the mean. The formula is:

$Z = \frac{x – \mu}{\sigma}$

Where:

$x$ is the data point,
$\mu$ is the mean of the data,
$\sigma$ is the standard deviation of the data.

4. Visualizing a Normal Distribution with a Density Plot

In addition to histograms, you can use density plots to visualize the probability density of a normal distribution. The seaborn library in Python makes it easy to create these plots.

Example: Density Plot for Normal Distribution

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Generate random data from a normal distribution
data = np.random.normal(50, 10, 1000)

# Create a density plot
sns.kdeplot(data, shade=True)

# Add labels and title
plt.title('Density Plot of Normal Distribution')
plt.xlabel('Value')

# Show the plot
plt.show()

5. Normal Distribution in Machine Learning

Many machine learning algorithms make assumptions about the distribution of data, and normal distribution is often the preferred assumption due to its mathematical properties. For example:

Linear regression assumes that the residuals (errors) are normally distributed.
Naive Bayes classifier with continuous variables assumes the data is normally distributed (Gaussian Naive Bayes).
Principal Component Analysis (PCA) works best when the features are normally distributed.

If your data does not follow a normal distribution, you can apply data transformations to make it more normal (e.g., log transformation, Box-Cox transformation).

6. Checking for Normality

Before applying machine learning algorithms, it’s important to check if your data follows a normal distribution. There are several ways to do this in Python:

A. Histogram

A simple visual inspection of the data using a histogram can give you a rough idea of whether the data is normally distributed.

plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram of Data')
plt.show()

B. Q-Q Plot (Quantile-Quantile Plot)

A Q-Q plot compares the quantiles of the data with the quantiles of a normal distribution. If the data is normally distributed, the points will lie along a straight diagonal line.

import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate normal Q-Q plot
stats.probplot(data, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

In a Q-Q plot, if the data points deviate significantly from the diagonal, it suggests that the data is not normally distributed.

C. Shapiro-Wilk Test

The Shapiro-Wilk test is a statistical test used to assess the normality of a dataset. It returns a p-value, which you can use to decide whether the data is normally distributed. A p-value greater than 0.05 suggests that the data is normally distributed, while a smaller p-value suggests non-normality.

from scipy.stats import shapiro

# Perform the Shapiro-Wilk test for normality
stat, p = shapiro(data)

print('Statistics=%.3f, p=%.3f' % (stat, p))

# Interpret the result
if p > 0.05:
    print('Data follows a normal distribution')
else:
    print('Data does not follow a normal distribution')

7. Data Transformation for Normality

If your data is not normally distributed, you can apply transformations to make it more normal. Common transformations include:

Log Transformation: Useful when the data is skewed to the right.
Square Root Transformation: Can reduce the impact of large values.
Box-Cox Transformation: This transformation is useful for making the data normally distributed.

Example: Log Transformation

import numpy as np
import matplotlib.pyplot as plt

# Generate skewed data
data = np.random.exponential(scale=2, size=1000)

# Apply log transformation
log_data = np.log(data + 1)

# Plot the transformed data
plt.hist(log_data, bins=30, edgecolor='black')
plt.title('Log Transformed Data')
plt.show()

8. Application of Normal Distribution in Machine Learning

Normal distributions play an important role in machine learning due to the following reasons:

Model Assumptions: Many models assume the data is normally distributed. For example, the Gaussian Naive Bayes classifier assumes that each feature follows a normal distribution.
Error Distribution: In regression problems, normally distributed residuals (errors) are assumed. If the residuals are normally distributed, it indicates that the model is correctly specified.
Feature Scaling: Normally distributed data ensures that features are on a similar scale, which can improve model performance. If the data is not normally distributed, standardization or normalization may be required to scale the data properly.

Conclusion

Understanding and working with normal data distribution is critical in machine learning, as many models rely on assumptions of normality. In Python, libraries like NumPy, Matplotlib, and Seaborn make it easy to generate, visualize, and assess the normality of your data. When data does not follow a normal distribution, transformations like log or Box-Cox can help make the data more suitable for machine learning models.

Post Views: 50