Python Machine Learning – K-nearest neighbors
The K-Nearest Neighbors (K-NN) algorithm is one of the simplest and most widely used machine learning algorithms. It is a non-parametric, lazy learning algorithm used for both classification and regression tasks. K-NN classifies data points based on the majority class among the K nearest data points in the feature space.
1. How K-NN Works
The core idea of K-NN is:
- Classification: To classify a new data point, the algorithm finds the K nearest neighbors (data points) from the training data, based on a distance metric (such as Euclidean distance), and assigns the class that is most frequent among these neighbors.
- Regression: For regression tasks, the K nearest neighbors are used to compute the average (or weighted average) of the target values to predict the target for a new data point.
The K in K-NN refers to the number of nearest neighbors to consider. The choice of K significantly affects the model’s performance:
- A small K may lead to overfitting (high variance).
- A large K may lead to underfitting (high bias).
2. Steps of the K-NN Algorithm
- Choose the number of K neighbors.
- Calculate the distance between the new data point and all other data points in the dataset.
- Sort the distances and select the K nearest neighbors.
- Assign the class for classification tasks (based on majority voting) or predict the value for regression tasks (using average or weighted average).
- Return the result.
3. Distance Metrics Used in K-NN
K-NN uses distance metrics to determine which data points are closer to the new data point. Common distance metrics include:
- Euclidean Distance (most commonly used):
- Manhattan Distance:
- Minkowski Distance (generalized form of Euclidean and Manhattan):
- Hamming Distance (for categorical data): Measures the number of positions at which corresponding elements differ.
4. Implementing K-NN in Python Using Scikit-learn
Scikit-learn provides an easy-to-use implementation of the K-NN algorithm for both classification and regression.
Example 1: K-NN Classification
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a K-NN classifier with K=5
knn_classifier = KNeighborsClassifier(n_neighbors=5)
# Train the classifier
knn_classifier.fit(X_train, y_train)
# Predict on the test set
y_pred = knn_classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
Explanation:
- The Iris dataset is used for multi-class classification.
- The dataset is split into training and testing sets.
- A K-NN classifier is created with
n_neighbors=5. - The model is trained on the training data and then tested on the test set.
- The accuracy of the classifier is printed.
Example 2: K-NN Regression
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a K-NN regressor with K=5
knn_regressor = KNeighborsRegressor(n_neighbors=5)
# Train the regressor
knn_regressor.fit(X_train, y_train)
# Predict on the test set
y_pred = knn_regressor.predict(X_test)
# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")
Explanation:
- A synthetic dataset for regression is generated using
make_regression. - The data is split into training and testing sets.
- A K-NN regressor is created with
n_neighbors=5. - The model is trained and predictions are made on the test set.
- The mean squared error (MSE) is used to evaluate the performance.
5. Choosing the Value of K
The value of K is crucial for the performance of K-NN. To choose the best value for K:
- Try different values of K (for example, using cross-validation).
- A smaller K might lead to a model that fits the noise in the data (overfitting).
- A larger K might smooth out the predictions, potentially missing important patterns (underfitting).
Example: Finding the Best K
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Loop over different values of K and calculate cross-validated accuracy
k_values = list(range(1, 11))
accuracy_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
accuracy_scores.append(scores.mean())
# Print accuracy for each value of K
for i, k in enumerate(k_values):
print(f"K={k}, Accuracy={accuracy_scores[i]:.4f}")
6. Strengths of K-NN
- Simple and easy to implement.
- No training phase: It stores the entire training data and performs classification or regression at the time of prediction.
- Flexible: Works well for both classification and regression tasks.
- No assumptions about the underlying data distribution.
7. Limitations of K-NN
- Computationally expensive: Since K-NN stores all training data, prediction involves calculating the distance to all other points in the dataset, which can be slow for large datasets.
- Sensitive to irrelevant features: All features are treated equally when calculating distance, so feature scaling (normalization) is essential.
- Sensitive to noisy data and outliers: K-NN can be heavily influenced by outliers if K is small.
- Memory-intensive: K-NN requires the entire dataset to be stored in memory.
8. Improving K-NN
- Feature scaling: Normalize or standardize the data to improve distance-based algorithms like K-NN.
- Weighted K-NN: Instead of treating all neighbors equally, you can give more weight to closer neighbors by using a distance-based weight.
from sklearn.neighbors import KNeighborsClassifier
# Use distance-based weights (closer neighbors are given more weight)
knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_weighted.fit(X_train, y_train)
9. Conclusion
K-NN is a simple and intuitive algorithm that can be used for both classification and regression tasks. Although it is not always the most efficient or accurate algorithm for large datasets, K-NN works well for small datasets and is often used as a baseline model. With proper tuning of the hyperparameter K and scaling of the data, K-NN can provide reasonable results for many real-world applications.