Python Machine Learning – K-nearest neighbors

The K-Nearest Neighbors (K-NN) algorithm is one of the simplest and most widely used machine learning algorithms. It is a non-parametric, lazy learning algorithm used for both classification and regression tasks. K-NN classifies data points based on the majority class among the K nearest data points in the feature space.

1. How K-NN Works

The core idea of K-NN is:

  • Classification: To classify a new data point, the algorithm finds the K nearest neighbors (data points) from the training data, based on a distance metric (such as Euclidean distance), and assigns the class that is most frequent among these neighbors.
  • Regression: For regression tasks, the K nearest neighbors are used to compute the average (or weighted average) of the target values to predict the target for a new data point.

The K in K-NN refers to the number of nearest neighbors to consider. The choice of K significantly affects the model’s performance:

  • A small K may lead to overfitting (high variance).
  • A large K may lead to underfitting (high bias).

2. Steps of the K-NN Algorithm

  1. Choose the number of K neighbors.
  2. Calculate the distance between the new data point and all other data points in the dataset.
  3. Sort the distances and select the K nearest neighbors.
  4. Assign the class for classification tasks (based on majority voting) or predict the value for regression tasks (using average or weighted average).
  5. Return the result.

3. Distance Metrics Used in K-NN

K-NN uses distance metrics to determine which data points are closer to the new data point. Common distance metrics include:

  • Euclidean Distance (most commonly used):

    d(x,y)=i=1n(xiyi)2d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i – y_i)^2}

  • Manhattan Distance:

    d(x,y)=i=1nxiyid(x, y) = \sum_{i=1}^{n} |x_i – y_i|

  • Minkowski Distance (generalized form of Euclidean and Manhattan):

    d(x,y)=(i=1nxiyip)1/pd(x, y) = \left( \sum_{i=1}^{n} |x_i – y_i|^p \right)^{1/p}

  • Hamming Distance (for categorical data): Measures the number of positions at which corresponding elements differ.

4. Implementing K-NN in Python Using Scikit-learn

Scikit-learn provides an easy-to-use implementation of the K-NN algorithm for both classification and regression.

Example 1: K-NN Classification

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a K-NN classifier with K=5
knn_classifier = KNeighborsClassifier(n_neighbors=5)

# Train the classifier
knn_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Explanation:

  • The Iris dataset is used for multi-class classification.
  • The dataset is split into training and testing sets.
  • A K-NN classifier is created with n_neighbors=5.
  • The model is trained on the training data and then tested on the test set.
  • The accuracy of the classifier is printed.

Example 2: K-NN Regression

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a K-NN regressor with K=5
knn_regressor = KNeighborsRegressor(n_neighbors=5)

# Train the regressor
knn_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_regressor.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

Explanation:

  • A synthetic dataset for regression is generated using make_regression.
  • The data is split into training and testing sets.
  • A K-NN regressor is created with n_neighbors=5.
  • The model is trained and predictions are made on the test set.
  • The mean squared error (MSE) is used to evaluate the performance.

5. Choosing the Value of K

The value of K is crucial for the performance of K-NN. To choose the best value for K:

  • Try different values of K (for example, using cross-validation).
  • A smaller K might lead to a model that fits the noise in the data (overfitting).
  • A larger K might smooth out the predictions, potentially missing important patterns (underfitting).

Example: Finding the Best K

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Loop over different values of K and calculate cross-validated accuracy
k_values = list(range(1, 11))
accuracy_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    accuracy_scores.append(scores.mean())

# Print accuracy for each value of K
for i, k in enumerate(k_values):
    print(f"K={k}, Accuracy={accuracy_scores[i]:.4f}")

6. Strengths of K-NN

  • Simple and easy to implement.
  • No training phase: It stores the entire training data and performs classification or regression at the time of prediction.
  • Flexible: Works well for both classification and regression tasks.
  • No assumptions about the underlying data distribution.

7. Limitations of K-NN

  • Computationally expensive: Since K-NN stores all training data, prediction involves calculating the distance to all other points in the dataset, which can be slow for large datasets.
  • Sensitive to irrelevant features: All features are treated equally when calculating distance, so feature scaling (normalization) is essential.
  • Sensitive to noisy data and outliers: K-NN can be heavily influenced by outliers if K is small.
  • Memory-intensive: K-NN requires the entire dataset to be stored in memory.

8. Improving K-NN

  • Feature scaling: Normalize or standardize the data to improve distance-based algorithms like K-NN.
  • Weighted K-NN: Instead of treating all neighbors equally, you can give more weight to closer neighbors by using a distance-based weight.
from sklearn.neighbors import KNeighborsClassifier

# Use distance-based weights (closer neighbors are given more weight)
knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_weighted.fit(X_train, y_train)

9. Conclusion

K-NN is a simple and intuitive algorithm that can be used for both classification and regression tasks. Although it is not always the most efficient or accurate algorithm for large datasets, K-NN works well for small datasets and is often used as a baseline model. With proper tuning of the hyperparameter K and scaling of the data, K-NN can provide reasonable results for many real-world applications.

Leave a Reply 0

Your email address will not be published. Required fields are marked *