Hierarchical Agglomerative Clustering for Product Grouping

Hierarchical Agglomerative Clustering for Product Grouping is an application that makes an order in your product register that way that products can be selected by its variants.

Imagine you have a bunch of items, and you want to group them based on how similar they are to each other. That’s where Hierarchical Agglomerative Clustering (HAC) comes into play. It’s like organizing a messy room by putting similar things together!

So, how does HAC work? Well, it starts by considering each item as its own cluster. Then, it repeatedly merges the two most similar clusters until there’s only one big cluster left. It’s like when you start by picking up any two items lying around and continue pairing up the most similar things until everything’s tidy.

One cool thing about HAC is that it doesn’t need you to tell it how many clusters you want in the end. It figures that out by itself based on how you define similarity. For example, if you’re sorting animals, you might say cats and dogs are more similar than cats and fish. In HAC, this similarity is often measured using things like Euclidean distance or cosine similarity.

There are different ways HAC decides which clusters to merge. One popular method is called “average linkage.” It calculates the average similarity between all pairs of items in the two clusters being considered for merging. If that average is high, it means the clusters are pretty similar, so they get merged.

Once HAC is done merging clusters, it gives you a tree-like structure called a dendrogram. It’s like a family tree showing how all the items are related. You can use this dendrogram to decide how many clusters you want by looking at where it makes the most sense to cut the tree. And just like that, you’ve got yourself a neatly organized room—or dataset!

Clustering is the method of separating given dataset into different sets in such a way that dataset within the cluster have greater similarities and dataset between clusters have more of dissimilarities [2]. Cluster analysis has some important functions in data mining and related fields such as pattern recognition, pattern classification, data discovery, vector quantization and data compression. Also, the role of clustering is inevitable in marketing, physics, biology, geography, and geology [19]. Clustering algorithms are
generally categorized into two: hierarchical clustering and non-hierarchical clustering [3] [18]. (link)

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity

prod_names = pd.read_excel("product_dataset.xlsx", sheet_name="product_names")
print(prod_names.info())
prod_names['name1'] = prod_names['name'].apply(lambda x: x.split(' - ')[0])

documents = prod_names['name1']

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Compute pairwise cosine similarity
cos_sim = cosine_similarity(X)

# Hierarchical Agglomerative Clustering
clustering = AgglomerativeClustering(
    n_clusters=None, 
    affinity='cosine', 
    linkage='average', 
    distance_threshold=0.25)
clustering.fit(X.toarray())

# Print cluster labels
print("Cluster labels:")
print(clustering.labels_)

prod_names['cluster'] = clustering.labels_
prod_names.sort_values('cluster').to_excel('xx.xlsx')

Dataset you can downlaud from here Download

Here the code how to ptont dendogram based on model. (taken from link)

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)



plt.title("Hierarchical Clustering Dendrogram")
# plot the top p levels of the dendrogram
plot_dendrogram(clustering, truncate_mode="level", p=6)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()