Python Machine Learning – K-means

K-means is a popular unsupervised machine learning algorithm used for clustering data into distinct groups. It attempts to partition the data into K clusters based on similarity, where each data point belongs to the cluster with the closest mean (centroid).

1. How K-means Works

The K-means algorithm operates in the following steps:

  1. Initialization: Select K random points as initial centroids (mean points of the clusters).
  2. Assignment: Assign each data point to the nearest centroid. This step forms K clusters.
  3. Update: Compute the new centroid of each cluster by taking the mean of the points in the cluster.
  4. Repeat: Repeat steps 2 and 3 until the centroids no longer change significantly, indicating convergence.

2. Python Implementation of K-means Using Scikit-learn

Scikit-learn provides an easy way to implement the K-means algorithm using the KMeans class.

Example: K-means on a Simple Dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Plot the dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()

# Apply K-means with 4 clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# Get the cluster centers (centroids) and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

# Plot the data with the centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.75)
plt.show()

Explanation:

  • We use the make_blobs function to generate a random dataset with 4 clusters.
  • The KMeans object is initialized with n_clusters=4, which means the algorithm will find 4 clusters.
  • After fitting the model to the data using kmeans.fit(X), we extract the centroids and labels (cluster assignments).
  • The data points are visualized along with the centroids.

Output:

The plot shows the dataset with color-coded clusters and the red centroids representing the center of each cluster.

3. Choosing the Right Number of Clusters (K)

One of the challenges with K-means is selecting the right number of clusters (K). A common method to determine the optimal value of K is the Elbow Method.

Elbow Method:

The Elbow Method involves running K-means for different values of K and plotting the sum of squared distances (inertia) between data points and their corresponding cluster centroids. The goal is to find the point where the decrease in inertia slows down, forming an “elbow” in the plot. This point represents the optimal number of clusters.

Example of Elbow Method:

inertia = []
K_values = range(1, 10)

for k in K_values:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot the elbow graph
plt.plot(K_values, inertia, 'bx-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

Explanation:

  • inertia is the sum of squared distances between each point and its assigned cluster’s centroid. Lower inertia means better clustering.
  • By plotting the inertia values against the number of clusters, you can find the “elbow” point where the inertia decreases less dramatically.

4. Limitations of K-means

  • Sensitive to the initial centroids: Different initializations may lead to different results.
  • Not good for non-spherical clusters: K-means assumes clusters are spherical and equally sized.
  • Requires you to know K: You must specify the number of clusters in advance.
  • Sensitive to outliers: Outliers can greatly affect the positions of the centroids.

5. K-means++ Initialization

To improve the robustness of K-means, Scikit-learn uses the K-means++ initialization by default. This initialization method selects the initial centroids in a smart way to avoid poor clustering results due to random centroid selection.

6. Applications of K-means

  • Customer segmentation: Group customers based on purchasing behavior.
  • Image compression: Cluster pixel values into a smaller set of colors.
  • Market segmentation: Segment market data to target different groups of users.
  • Document clustering: Group similar documents together.

7. Conclusion

K-means is a fast and simple algorithm for clustering, but it requires careful selection of the number of clusters and can struggle with complex data distributions. With the help of methods like the Elbow Method and K-means++ initialization, K-means can be fine-tuned for better performance.

Leave a Reply 0

Your email address will not be published. Required fields are marked *