Evaluation of Clustering Algorithms

Software clustering researchers have developed several evaluation methods for software clustering algorithms. This research is important because

Most software clustering work is evaluated based on case studies. It is important that the evaluation technique is not subjective.
Evaluation helps discover the strengths and weaknesses of the various software clustering algorithms. This allows the development of better algorithms through addressing the discovered weaknesses.
Evaluation can help indicate the types of system that are suitable for a particular algorithm. For instance, Mitchell et al. in “CRAFT: A Framework for Evaluating Software Clustering Results in the Absence of Benchmark Decompositions” paper think that Bunch may not be suitable for event-driven systems.

The importance of evaluating software clustering algorithms was first stated in 1995 by Lakhotia and Gravely in “Toward experimental evaluation of subsystem classification recovery techniques” paper. Since then, many approaches to this problem have been published in the literature. These can be divided in two categories:

Based on an authoritative decomposition
Not based on an authoritative decomposition

Evaluation based on authoritative decomposition

The main principle of evaluation based on an authoritative decomposition is that a clustering produced by an algorithm should resemble the clustering produced by some authority. Therefore, such evaluation methods provide the means to calculate the quality of an automatic decomposition by comparing it to the authoritative one.

The evaluation methods can be divided in two categories. The first category evaluates a software clustering approach based on the comparison between the authoritative decomposition and the automatic decomposition. A typical example of such a method is the MoJo distance in “MoJo: A Distance Metric for Software Clusterings” paper. Most of the evaluation methods that have been developed belong to this category.

Evaluation methods of the second category focus on the evaluation of a specific stage of the software clustering process. Such a method calculates the quality of a specific stage based on analysis of its inputs and outputs. For instance, an evaluation method, that focuses on the evaluation of the analysis phase, will take into account the input meta-model, compare the authoritative decomposition and the produced software clustering decomposition, and then calculate a number that reflects the quality of the produced decomposition. A typical example of such a method is EdgeSim in “Comparing the decompositions produced by software clustering algorithms using similarity measurements” paper.

The main reason for the development of both types of evaluation methods is that the quality of a produced decomposition depends on the:

Selection of an appropriated clustering algorithm – A different clustering algorithm produces different outputs from the same factbase.
Selection of input parameters – A clustering algorithm might produce different outputs depending on selected input parameters, such as similarity function, input meta-model etc.

An orthogonal categorization of software clustering evaluation methods divides them into two classes:

Evaluation of software clustering algorithms that produce flat decompositions
Evaluation of software clustering algorithms that produce nested decompositions

Evaluation methods not based on an authoritative decomposition

The main drawback of methods that require an authoritative decomposition is that they assume that such a decomposition exists. To construct such a decomposit ion for a middle-size software system is a challenging task. Various research studies deal with the construction of an authoritative decomposition. This work addresses two questions:

What is the right process for the construction of an authoritative decomposition?
Why is the constructed decomposition authoritative?

Several evaluation methods that evaluate a software clustering approach based solely on its output have been developed. Jingwei et al. in “Comparison of Clustering Algorithms in the Context of Software Evolution” paper suggest comparing software clustering algorithms based on two criteria:

Stability
Extremity of Cluster Distribution

The main criticize of both methods is that they can only identify ``bad'' software decompositions. It is easy to develop an algorithm that produces decompositions that are stable and non-extreme, while being entirely useless for program understanding purposes. Several methods attempt evaluate software clustering algorithms based on authomatically constructed “authoritative” decompositions.

Stability

Stability reflects how sensitive is a clustering approach to perturbations of the input data. The intuition behind stability in software clustering is that similar clustering decompositions should be produced for similar versions of a software system. A stability function measures the percentage of changes between produced decompositions of successive versions of an evolving software system. Under conditions of small changes between consecutive versions, an algorithm should produce similar clustering.

Extremity of Cluster Distribution

An interesting property of a clustering algorithm is the size of the clusters it produces. The cluster size distribution of a clustering algorithm should not exhibit extremity. In other words, a clustering algorithm should avoid the following situations:

The majority of files are grouped into one or few huge clusters
The majority of clusters are singletons

Software Clustering

Table of Contents

Evaluation of Clustering Algorithms

Evaluation based on authoritative decomposition

Evaluation methods not based on an authoritative decomposition

Stability

Extremity of Cluster Distribution

See also