Hierarchical clustering is an unsupervised learning algorithm used to build a hierarchy of clusters. It seeks to create a tree of clusters called a dendrogram, which can then be used to decide the level at which to cut the tree to form clusters. There are two main types of hierarchical clustering:
1. Agglomerative Hierarchical Clustering (Bottom-Up):
- Starts with each data point as a single cluster.
- Iteratively merges the closest pairs of clusters until all points are in a single cluster or the desired number of clusters is reached.
2. Divisive Hierarchical Clustering (Top-Down):
- Starts with all data points in a single cluster.
- Iteratively splits the most heterogeneous cluster until each data point is in its own cluster or the desired number of clusters is reached.
## Linkage Criteria
The choice of how to measure the distance between clusters affects the structure of the dendrogram:
- Single Linkage: Minimum distance between points in two clusters.
- Complete Linkage: Maximum distance between points in two clusters.
- Average Linkage: Average distance between points in two clusters.
- Ward's Method: Minimizes the variance within clusters.
## Implementation Example
Suppose we have a dataset with points in 2D space, and we want to cluster them using hierarchical clustering.
Plots
## Explanation of the Code1. Importing Libraries
2. Data Preparation: We generate a synthetic dataset with three clusters using normal distributions.
3. Linkage: We use the linkage function from scipy.cluster.hierarchy to perform hierarchical clustering with Ward's method.
4. Dendrogram: We plot the dendrogram using the dendrogram function to visualize the hierarchical structure.
5. Cutting the Dendrogram: We cut the dendrogram at a specific threshold to form clusters using the fcluster function.
6. Plotting Clusters: We scatter plot the data points with colors indicating the assigned clusters.
#### Choosing the Number of Clusters
The dendrogram helps visualize the hierarchy of clusters. The choice of where to cut the dendrogram (i.e., selecting a threshold distance) determines the number of clusters. This choice can be subjective, but some guidelines include:
- Elbow Method: Similar to k-Means, look for an "elbow" in the dendrogram where the distance between merges increases significantly.
- Maximum Distance: Choose a distance threshold that balances the number of clusters and the compactness of clusters.
## Applications
Hierarchical clustering is widely used in:
- Gene Expression Data: Grouping similar genes or samples in bioinformatics.
- Document Clustering: Organizing documents into a hierarchical structure.
- Image Segmentation: Dividing an image into regions based on pixel similarity.
Credits: t.me/datasciencefun
Cracking the Data Science Interview (https://topmate.io/analyst/1024129)
ENJOY LEARNING 👍👍
No comments:
Post a Comment