Friday, 3 January 2025

Hour 10 K-Means Clustering

####Concept

 k-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into \( k \) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative algorithm that aims to minimize the variance within each cluster.

The steps involved in k-Means clustering are:

1. Initialization: Choose \( k \) initial cluster centroids randomly.

2. Assignment: Assign each data point to the nearest cluster centroid.

3. Update: Recalculate the centroids as the mean of all points in each cluster.

4. Repeat: Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.

#### Implementation Example

Suppose we have a dataset with points in 2D space, and we want to cluster them into \( k = 3 \) clusters.

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns


# Example data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
               np.random.normal(5, 1, (100, 2)),
               np.random.normal(-5, 1, (100, 2))))

# Applying k-Means clustering
k = 3
kmeans = KMeans(n_clusters=k, random_state=0)
y_kmeans = kmeans.fit_predict(X)

# Plotting the clusters
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis',
                s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=300, c='red', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering')
plt.legend()
plt.show()

Plot

## Explanation of the Code

1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.

2. Data Preparation: We generate a synthetic dataset with three clusters using normal distributions.

3. k-Means Clustering: We create a KMeans object with \( k=3 \) clusters and fit it to the data. The fit_predict method assigns each data point to a cluster.

4. Plotting: We scatter plot the data points with colors indicating the assigned clusters and plot the centroids in red.

#### Choosing the Number of Clusters

Selecting the appropriate number of clusters (\( k \)) is crucial. Common methods to determine \( k \) include:

- Elbow Method: Plot the within-cluster sum of squares (WCSS) against the number of clusters and look for an "elbow" point where the rate of decrease sharply slows.

- Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters.

## Elbow Method Example

# Elbow Method to find the optimal number of clusters

wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(8,6))
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

Plot

## Evaluation Metrics

- Within-Cluster Sum of Squares (WCSS): Measures the compactness of the clusters. Lower WCSS indicates more compact clusters. 

Fit the clustering algorithm (e.g., K-Means) to your data for a range of k values. For each k, calculate the WCSS. Plot the number of clusters (k) against their corresponding WCSS values. Look for the “elbow” point in the plot.

Refer: https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/

- Silhouette Score: Measures the separation between clusters. Values range from -1 to 1, with higher values indicating better-defined clusters.

 Refer 

    1. https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam
    2. https://vitalflux.com/kmeans-silhouette-score-explained-with-python-example/

#### Applications

k-Means clustering is widely used in:

- Market Segmentation: Grouping customers based on purchasing behavior.

- Image Compression: Reducing the number of colors in an image.

- Anomaly Detection: Identifying outliers in a dataset.

k-Means is efficient and easy to implement but can be sensitive to the initial placement of centroids and the choice of \( k \). It works well for spherical clusters but may struggle with non-spherical or overlapping clusters.

Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624

Credits: t.me/datasciencefun

No comments:

Post a Comment

Hour 30 Hyperparameter Optimization

#### Concept Hyperparameter optimization involves finding the best set of hyperparameters for a machine learning model to maximize its perfo...