This is an old revision of the document!

Software clustering researchers have developed several evaluation methods for software clustering algorithms. This research is important because

Most software clustering work is evaluated based on case studies. It is important that the evaluation technique is not subjective.
Evaluation helps discover the strengths and weaknesses of the various software clustering algorithms. This allows the development of better algorithms through addressing the discovered weaknesses.
Evaluation can help indicate the types of system that are suitable for a particular algorithm. For instance, Mitchell et al. in “CRAFT: A Framework for Evaluating Software Clustering Results in the Absence of Benchmark Decompositions” paper think that Bunch may not be suitable for event-driven systems.

The importance of evaluating software clustering algorithms was first stated in 1995 by Lakhotia and Gravely in “Toward experimental evaluation of subsystem classification recovery techniques” paper. Since then, many approaches to this problem have been published in the literature. These can be divided in two categories:

Based on an authoritative decomposition
Not based on an authoritative decomposition

Evaluation based on authoritative decomposition

The main principle of evaluation based on an authoritative decomposition is that a clustering produced by an algorithm should resemble the clustering produced by some authority. Therefore, such evaluation methods provide the means to calculate the quality of an automatic decomposition by comparing it to the authoritative one.

The evaluation methods can be divided in two categories. The first category evaluates a software clustering approach based on the comparison between the authoritative decomposition and the automatic decomposition. A typical example of such a method is the MoJo distance in “MoJo: A Distance Metric for Software Clusterings” paper. Most of the evaluation methods that have been developed belong to this category.

Evaluation methods of the second category focus on the evaluation of a specific stage of the software clustering process. Such a method calculates the quality of a specific stage based on analysis of its inputs and outputs. For instance, an evaluation method, that focuses on the evaluation of the analysis phase, will take into account the input meta-model, compare the authoritative decomposition and the produced software clustering decomposition, and then calculate a number that reflects the quality of the produced decomposition. A typical example of such a method is EdgeSim in “Comparing the decompositions produced by software clustering algorithms using similarity measurements” paper.

The main reason for the development of both types of evaluation methods is that the quality of a produced decomposition depends on the:

Selection of an appropriated clustering algorithm – A different clustering algorithm produces different outputs from the same factbase.
Selection of input parameters – A clustering algorithm might produce different outputs depending on selected input parameters, such as similarity function, input meta-model etc.

An orthogonal categorization of software clustering evaluation methods divides them into two classes:

Evaluation of software clustering algorithms that produce flat decompositions
Evaluation of software clustering algorithms that produce nested decompositions