Python Machine Learning – Mean Median Mode
In Machine Learning, understanding the concepts of Mean, Median, and Mode is essential for analyzing and summarizing data. These are basic statistical measures used to describe the central tendency of a dataset. Let’s go through each concept and how you can compute them in Python.
1. Mean (Average)
The mean is the sum of all the values in a dataset divided by the number of values. It represents the “average” of the data.
Formula for Mean:
Mean=∑(All Values)Number of Values\text{Mean} = \frac{\sum \text{(All Values)}}{\text{Number of Values}}
Example: Calculating the Mean in Python
You can compute the mean using Python’s built-in functions or libraries like NumPy.
import numpy as np
# Example dataset
data = [5, 10, 15, 20, 25]
# Calculate mean using NumPy
mean_value = np.mean(data)
print(f'Mean: {mean_value}')
Output:
Mean: 15.0
2. Median
The median is the middle value in a sorted dataset. If the dataset has an odd number of observations, the median is the middle number. If the dataset has an even number of observations, the median is the average of the two middle numbers.
Example: Calculating the Median in Python
You can compute the median using Python’s statistics module or NumPy.
import numpy as np
# Example dataset
data = [5, 10, 15, 20, 25]
# Calculate median using NumPy
median_value = np.median(data)
print(f'Median: {median_value}')
Output:
Median: 15.0
If the dataset is even, the median would be the average of the two middle values.
Example with Even Number of Values:
data = [5, 10, 15, 20]
# Calculate median using NumPy
median_value = np.median(data)
print(f'Median (even dataset): {median_value}')
Output:
Median (even dataset): 12.5
3. Mode
The mode is the value that appears most frequently in a dataset. There can be multiple modes if multiple values appear with the same highest frequency, and if no value repeats, the dataset has no mode.
Example: Calculating the Mode in Python
You can compute the mode using Python’s statistics module.
from statistics import mode
# Example dataset
data = [5, 10, 10, 15, 20, 25]
# Calculate mode using statistics module
mode_value = mode(data)
print(f'Mode: {mode_value}')
Output:
Mode: 10
In this case, 10 is the mode because it appears twice, more frequently than the other values.
Handling Multiple Modes:
If there are multiple modes, you can use multimode() to get all the modes.
from statistics import multimode
# Example dataset with multiple modes
data = [5, 10, 10, 15, 15, 20]
# Calculate multiple modes using statistics module
modes = multimode(data)
print(f'Modes: {modes}')
Output:
Modes: [10, 15]
4. Python Libraries for Calculating Mean, Median, and Mode
- NumPy: Useful for large numerical arrays and matrices, and it includes functions for computing mean and median.
- Statistics Module: Part of Python’s standard library, useful for computing mean, median, mode, and other statistical measures on basic lists.
Example: Using the Statistics Module
import statistics
# Example dataset
data = [5, 10, 15, 20, 25]
# Calculate mean, median, and mode using the statistics module
mean_value = statistics.mean(data)
median_value = statistics.median(data)
# No mode in this dataset, mode would raise an error
print(f'Mean: {mean_value}')
print(f'Median: {median_value}')
5. Handling Missing Values in Mean, Median, Mode
In real-world datasets, you may encounter missing values (NaN). Python libraries like pandas provide functions to handle these cases.
import pandas as pd
# Example dataset with NaN (missing value)
data = [5, 10, None, 20, 25]
# Create a pandas Series
data_series = pd.Series(data)
# Calculate mean and median, ignoring NaN values
mean_value = data_series.mean()
median_value = data_series.median()
print(f'Mean (ignoring NaN): {mean_value}')
print(f'Median (ignoring NaN): {median_value}')
6. When to Use Mean, Median, and Mode
- Mean: Best when the data is normally distributed and does not have outliers. Outliers can skew the mean.
- Median: Best when the data is skewed or contains outliers. The median is less sensitive to extreme values.
- Mode: Best for categorical data or when you need to know the most common value in a dataset.
Conclusion
In machine learning and data analysis, understanding Mean, Median, and Mode is crucial for summarizing and understanding data distributions. Python provides easy-to-use libraries like NumPy and the statistics module to perform these calculations.