Unsupervised Learning: Clustering and Dimensionality Reduction in Python

Unsupervised learning is a type of machine learning where the model is not provided with labeled data. The model learns the underlying structure and patterns in the data without any specific guidance on what to look for. Clustering and Dimensionality Reduction are two important techniques in unsupervised learning.

Clustering

Clustering is a technique where the model tries to identify groups in the data based on their similarities. The objective is to group similar data points together and separate dissimilar data points. Clustering algorithms can be used for a variety of applications such as customer segmentation, anomaly detection, and image segmentation.

Dimensionality Reduction

Dimensionality reduction is a technique where the model tries to reduce the number of features in the data while retaining as much information as possible. This is useful when dealing with high-dimensional data where it’s difficult to visualize and analyze the data. Dimensionality reduction algorithms can be used for a variety of applications such as data compression, feature extraction, and visualization.

Clustering Algorithms

There are several clustering algorithms in machine learning, each with its own strengths and weaknesses. In this tutorial, we will cover two popular clustering algorithms: K-Means Clustering and Hierarchical Clustering.

K-Means Clustering

K-Means Clustering is a simple and efficient clustering algorithm. The algorithm partitions the data into K clusters based on their similarity. The number of clusters K is specified by the user. The algorithm starts by randomly selecting K data points as the initial centroids. The data points are then assigned to the nearest centroid based on their distance. The centroid is then updated based on the mean of the data points in the cluster. This process is repeated until convergence.

Let’s see how to implement K-Means Clustering in Python using Scikit-Learn.

from sklearn.cluster import KMeans
import numpy as np

# Generate random data
X = np.random.rand(100, 2)
# Initialize KMeans model with 2 clusters
kmeans = KMeans(n_clusters=2)
# Fit the model to the data
kmeans.fit(X)
# Predict the clusters for the data
y_pred = kmeans.predict(X)
# Print the centroids of the clusters
print(kmeans.cluster_centers_)

In this example, we generate random data with 2 features and 100 data points. We then initialize the KMeans model with 2 clusters and fit the model to the data. We then predict the clusters for the data and print the centroids of the clusters.

Hierarchical Clustering

Hierarchical Clustering is a clustering algorithm that builds a hierarchy of clusters. The algorithm starts by treating each data point as a separate cluster. The algorithm then iteratively merges the closest clusters based on their distance until all the data points belong to a single cluster.

There are two types of hierarchical clustering algorithms: Agglomerative and Divisive. Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest clusters. Divisive clustering starts with all data points in a single cluster and iteratively splits the cluster into smaller clusters.

Let’s see how to implement Agglomerative Hierarchical Clustering in Python using Scikit-Learn.

from sklearn.cluster import AgglomerativeClustering
import numpy as np

# Generate random data
X = np.random.rand(100, 2)
# Initialize AgglomerativeClustering model with 2 clusters
agg_clustering = AgglomerativeClustering(n_clusters=2)
# Fit the model to the data
agg_clustering.fit(X)
# Predict the clusters for the data
y_pred = agg_clustering.labels_
# Print the labels of the clusters
print(y_pred)

In this example, we generate random data with 2 features and 100 data points. We then initialize the AgglomerativeClustering model with 2 clusters and fit the model to the data. We then predict the clusters for the data and print the labels of the clusters.

Divisive Hierarchical Clustering

Divisive Hierarchical Clustering is a clustering algorithm that starts with all data points in a single cluster and iteratively splits the cluster into smaller clusters. The algorithm starts by treating all data points as a single cluster. The algorithm then iteratively splits the cluster into smaller clusters based on their dissimilarity until each data point belongs to a separate cluster.

Divisive Hierarchical Clustering is not as popular as Agglomerative Hierarchical Clustering because it is computationally expensive and tends to produce imbalanced clusters.

Dimensionality Reduction Algorithms

There are several dimensionality reduction algorithms in machine learning, each with its own strengths and weaknesses. In this tutorial, we will cover two popular dimensionality reduction algorithms: Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that tries to find the orthogonal directions of maximum variance in the data. The objective is to find a lower-dimensional representation of the data that retains as much information as possible. PCA is useful when dealing with high-dimensional data where it’s difficult to visualize and analyze the data.

Let’s see how to implement PCA in Python using Scikit-Learn.

from sklearn.decomposition import PCA
import numpy as np

# Generate random data
X = np.random.rand(100, 10)
# Initialize PCA model with 2 components
pca = PCA(n_components=2)
# Fit the model to the data
pca.fit(X)
# Transform the data to 2 dimensions
X_transformed = pca.transform(X)
# Print the shape of the transformed data
print(X_transformed.shape)

In this example, we generate random data with 10 features and 100 data points. We then initialize the PCA model with 2 components and fit the model to the data. We then transform the data to 2 dimensions and print the shape of the transformed data.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique that tries to preserve the pairwise distances between the data points in the lower-dimensional representation. The objective is to find a lower-dimensional representation of the data that retains the local structure of the data. t-SNE is useful when dealing with high-dimensional data where it’s difficult to visualize and analyze the data.

Let’s see how to implement t-SNE in Python using Scikit-Learn.

from sklearn.manifold import TSNE
import numpy as np

# Generate random data
X = np.random.rand(100, 10)
# Initialize t-SNE model with 2 components
tsne = TSNE(n_components=2)
# Fit the model to the data
X_transformed = tsne.fit_transform(X)
# Print the shape of the transformed data
print(X_transformed.shape)

In this example, we generate random data with 10 features and 100 data points. We then initialize the t-SNE model with 2 components and fit the model to the data. We then transform the data to 2 dimensions and print the shape of the transformed data.

In this tutorial, we covered two important techniques in unsupervised learning: Clustering and Dimensionality Reduction. We also covered two popular algorithms for each technique: K-Means Clustering and Hierarchical Clustering for Clustering, and PCA and t-SNE for Dimensionality Reduction. We also provided code examples in Python using Scikit-Learn.

I hope you found this tutorial useful in understanding Unsupervised Learning. To learn more about Machine Learning, I hope you will consider checking out my book: Unsupervised Learning: Clustering and Dimensionality Reduction (https://a.co/d/3AQdFnG)

LyronFoster

Lyron Foster is a Hawai’i based African American Author, Musician, Actor, Blogger, Philanthropist and Multinational Serial Tech Entrepreneur.