An overview about BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Jafeel V
4 min readFeb 16, 2022

Introduction:

Clustering is the process of arranging data into classes or clusters so that things within a cluster have a high resemblance to one another but are substantially different from those in other clusters. Hierarchical and partitioning methods are the two basic types of clustering methods. In hierarchical clustering, Data objects are grouped into clusters, which are then grouped into larger clusters, and so on, producing a hierarchy of clusters.

Adjustment is not possible with traditional hierarchical clustering. After a merge or split decision has been made, it will not undo what has already been done, nor will it do object swapping between clusters. As a result, if a merge or split decision is made incorrectly at some point, it may result in low-quality clusters. Most of the existing hierarchical clustering algorithms do not consider the case that dataset can be too large to fit in main memory. and they do not concentrate on minimizing the number of scans of the dataset.

Integrating hierarchical clustering with other techniques for multiple phase clustering is one way for increasing the clustering quality of hierarchical algorithms. BIRCH is one such algorithm, which was developed by Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.

Explanation:

BIRCH (Balanced iterative Reducing and Clustering Hierarchies) is an unsupervised data mining algorithm that clusters high volumes of numerical data using an agglomerative approach. Agglomerative hierarchical clustering is a bottom-up clustering method in which clusters are divided into sub-clusters, which are further divided into sub-clusters. However BIRCH overcomes the scalability problem of the previous clustering, it can only process metric attributes (Any attribute whose values may be represented in Euclidean space is called a metric attribute.). It is one disadvantage of BIRCH algorithm.

The BIRCH clustering algorithm introduces two concepts. Clustering feature and Clustering feature tree (CF tree). CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering.

A clustering feature (CF) is a three-dimensional vector summarizing information about clusters of objects. It is an ordered triplet (N, LS, SS). where ‘N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the cluster.

For example, Suppose that there are three points, (2,5), (3,2), and (4,3), in a cluster, C. The clustering feature of C is CF = (3, (2+3+4,5+2+3), (2² +3² +4²,5² +2² +3²)) = (3,(9,10), (29,38)).

CF Tree:

It is a height-balanced tree that stores the clustering features for a hierarchical clustering as shown in below figure. And they are used to summarize cluster representations. The internal or non-leaf nodes store sum of CFs of their children. A CF tree has two parameters: branching factor B and threshold T.

Image credits: From notes by By T, Zhang, R. Ramakrishnan & and M.Livny

The branching factor specifies the maximum number of children per non-leaf node, whereas the threshold specifies the maximum diameter of sub clusters at leaf node.

BIRCH Clustering Algorithm

Phase 1: Load data into memory -Scan DB and load data into memory by building a CF tree. Rebuild the tree from the leaf node if memory is exhausted.

Phase 2: Condense data (optional)- Resize the data set by building a smaller CF tree. Remove more outliers.

Phase 3: Global clustering -Use existing clustering algorithm (e.g. KMEANS) on CF entries.

Phase 4: Cluster refining (optional)- Fixes the problem with CF trees where same valued data points may be assigned to different leaf entries.

Building the CF Tree

  • Starting with the root.
  • Find the CF entry in the root that is closest to the data point, then travel to that child and repeat the same process until a leaf entry is discovered that is closest to the data point.
  • At the leaf,

→If the point can be accommodated in the cluster, update the entry.

→If the Radius (R) of the selected node crosses the T threshold, split the entry; if it crosses the L limit, split the leaf. Split the parent node if it is also full, and so on.

  • Update the CF entries from the root to the leaf to accommodate this point.

X0 is the centroid of cluster and Xi is the data points in a cluster.

R is the average distance from member points to the centroid, whereas, D is the average pairwise distance within a cluster.

The Distance between the leaf nodes are calculated with D0 and D1.

D0 is the centroid euclidean distance & D1 is centroid manhatten distance

Advantages:

  • With just one scan, it finds a good clustering and enhances the quality with a few more scans.
  • Works with very large datasets.
  • Superior to other algorithms in terms of stability and scalability.

Disadvantages:

  • Handles only numeric data.

Applications:

  • Pixel classification in images.
  • Image compression.

References:

--

--