Design Pattern Detection

Background \section{Background} \label{background}

\subsection{Language Independence} Fabry and Mens \cite{Fabry2004} proposed a language independent approach for object oriented design patterns through the use of logic meta programming. They separated the language-independent commonalities from language specific variabilities, which enabled them to write logic rules to reason about the structure of object oriented languages, independent of the actual object oriented programming language. They used Smalltalk and Java as an example of two object oriented programming languages that have significant differences. They presented the SOUL logic metaprogramming system, which was extended to support Java reasoning (SOULJava is the extended version of SOUL). They worked on a meta-level interface to provide the representational mapping between the logic meta language and the object oriented base language. Moreover, they worked on a logic repository of all the predicates available for the reasoning engine. The main idea is to provide a logical representation of object oriented source code, and try to identify best practice patterns and design patterns using language independent logic rules. This approach was tested against two Smalltalk and two Java applications. A very limited number of design patterns and best practice patterns were tested. No false positives or false negatives were reported, which is a good indicator of the approach's reliability. Their work shows the feasibility of reasoning about object oriented systems in a language independent way. Moreover, further work needs to be done to apply the approach on other design patterns, which as they mentioned will require a small number of extensions.
\subsection{Results Visualization} Another example is in \cite{Amiot2001}. Albin-Amiot et al. presented a set of tools and techniques to solve the problem of automating the instantiation and the detection of design patterns. They presented the \textit{Pattern Description Language}(PDL) that describes design patterns as first-class entities, providing their associated code and detecting their occurrences in existing code. PDL defines a design pattern's abstract model describing its structural and behavioral aspects. This provides a formalization of design pattern descriptions in order to handle the instantiation and detection of design patterns. They presented a tool called PTIDEJ \textit{Pattern Trace Identification, Detection, and Enhancement for Java} which can be used to automatically manipulate design pattern models using PDL, identify and visualize design pattern complete or distorted versions in the source code, and refactor the distorted ones to best comply with the corresponding design pattern.
\subsection{Systems Parsing} Different approaches have been proposed to statically analyze software systems. Systems could be parsed to another form such as graphs. The new form usually consists of entities, attributes, and relations that need to be examined by tools to detect design patterns.
For instance, Czibula and Serban \cite{Czibula2008} proposed an original clustering-based approach for identifying instances of design patterns. Their approach consisted of four steps: first is \textit{data collection}, where they analyze the system to extract all the relevant entities, such as classes, methods, attributes, etc. They then compute the distance between pairs of classes involved in a relation. Second is \textit{preprocessing}, where they reduce the search space by eliminating classes that can never be part of the design pattern to be detected. Third is \textit{grouping}, where they group entities based on dissimilarity measures between classes. Finally is \textit{design pattern recovery}, where they extract design pattern instances from the generated clusters that meet the design pattern constraints. They only reported one experiment of detecting the Proxy design pattern using this approach, two of the three detected instances were true with no further analysis.
Moreover, the main idea in the study conducted by Kaczor et al. \cite{Kaczor2006} was to convert the program to a string representation. They presented a bit vector algorithm for design pattern detection. Such an algorithm could find the solution to a problem in a bounded number of vector operations, which is independent from the program length. They first convert models of the design motifs and the program in digraphs, considering only binary relationships. Then they convert the models to Eulerian digraphs by adding dummy edges between vertices with unequal in-degree and out-degree edges. The minimum Eulerian circuit is computed in the next step to obtain a unique string representation of a design motif and of the program models. Finally, an iterative Bit-vector algorithm is used to identify design patterns by matching the string representation of design motifs and the string representation of the program models.
\subsection{Tools Scalability} Many studies aim to solve the scalability problem faced by design pattern detection tools. Antoniol et al. \cite{Antoniol1998} presented a conservative approach based on a multi-stage reduction strategy using OO software metrics and structural properties to extract structural design patterns from OO design or code. Their approach exploits software metrics to reduce search space complexity and makes use of method delegation information to further reduce the cardinality of the set of the retrieved pattern candidates. An Abstract Object Language (AOL) is used to represent the code or the design, to keep the approach independent from the programming language. This representation contains information about classes, methods, attributes, as well as the relations between classes. The recovery approach consisted of 5 steps; first the AOL representation is extracted from the code and design (AOL is based on UML), then parsed to AST to be used in the subsequent steps. Next the AST is traversed and decorated with class level metrics computed on the AOL AST. A multi-stage constraint evaluator is then used to evaluate design pattern instances. The constraints are essential to reduce the search space dimensions. This is done through metrics constraint evaluation (for instance, number of inheritance, association, aggregation relationships), followed by structural constraints evaluation, and delegation constraints evaluation. Doing the metrics constraints evaluation first helps reduce the search space complexity and the cardinality of the candidate set.
Another study that focuses on enhancing systems scalability is a study conducted by Gueheneuc et al. \cite{Gueheneuc2004}. They worked on finding a way to reduce the search space for identifying micro-architectures similar to design motifs. They used a set of attributes to identify all classes playing the same role in a design motif. These attributes (size, filiation, cohesion, coupling) are later used to eliminate from the search space. First, they manually built a repository of classes playing roles in design motifs. Then, for each of the mentioned attributes, they used metrics proposed in the literature to associate a set of values with each class. After that, a machine learning algorithm is used to infer a set of rules. The rules are later validated and interpreted to create a set of fingerprints identifying roles in design motifs. These rules are not used to detect design patterns. Instead, they can be used to reduce the search space of participating classes by eliminating classes that do not have the sufficient fingerprint for the candidate role. This approach has been implemented in the PTIDEJ tool, where in some cases it helped reduce the search space by approximately 89\%. The process of calculating the metrics for classes is not time consuming. On the other hand, the process of building the repository and training the data needs most of the work.
\subsection{Tools Evaluation} They pursued their work by creating a benchmark for comparing and evaluating design pattern miner tools in \cite{Fulop2008}. Their ongoing work is language, tool, pattern, and software independent. The benchmark contains three design pattern mining tools and over a thousand design pattern instances. A new tool can be easily added to the benchmark database. The benchmark tool can be used to browse the database and query certain information based on language, software, tool and/or pattern. Pattern instances are shown in detail along with standard statistics based on user evaluation, completeness, and correctness of the results. Precision and recall rates are also available for design pattern instances, which are based on the human evaluation of the instances and a preset threshold value, which means that a detected pattern instance is considered true if the average score of the evaluation is more than the threshold value.
An interesting study was conducted by Kniesel and Binun \cite{Kniesel2009}. The study combined five design pattern detection tools and proposed an approach called \textit{Data Fusion} for combining their results in order to identify patterns not detected by individual tools. They evaluated the similarity scoring tool, DP-Miner, PINOT, PTIDEJ, and FUJABA and reached the following finding. Due to property relaxation, a weaker set of constraints of a design pattern holds. Therefore, tools identify general patterns more that the actual patterns. They proposed notations of subpattern and superpattern relationships that are purely technical and not related to the intent. All subpattern instances are also instances of the superpattern. Making the superpattern a more general version. Moreover, they say a pattern is smaller if it is defined by a weaker set of constraints. Therefore, smaller patterns are identified more reliably with higher confidence by more than one tool. On the other hand, big patterns are rarely classified identically by the tools, and if they are identified identically then they have a very high chance of being true positives. They also noticed that if tools do not agree on big patterns, sometimes they agree on their superpatterns. A superpattern is a witness of a subpattern. EDPs are also considered as witnesses. They presented a set of rules for combining the results of the five studied tools. Their approach succeeded in increasing the recall and the precision rates for some design patterns, while no improvement was achieved in other patterns due to the high number of false positives provided by the tools.

Example \section{Stage 2: Field Type Probing}

\label{field_type_probing}

For this stage, we developed a tool called VariableExplorer, available at \url{http://www.cse.yorku.ca/~haneend/FINDER/scripts/VariableExplorer.jar} \footnote{ VariableExplorer source code is available at \url{http://www.cse.yorku.ca/~haneend/VariableExplorer}}. VariableExplorer aims to find additional information about the type of objects contained in collections, as Javex fails to provide such information. It collects information about methods invoked on fields, and the parameters sent to those methods. It also collects information about field types. This information is later used to estimate the type of objects in collections.
Java source files for the system being examined are required for VariableExplorer. This tool consists of two parts: The first parses the program to an AST (discussed in Section \ref{ast}). The second traverses the AST to extract facts about method invocations. The traversal is performed using a visitor class called VariableExplorerVisitor in VariableExplorer. VariableExplorerVisitor is an extension to the org.eclipse.jdt.core.dom.ASTVisitor class which provides a visitor for abstract syntax trees. VariableExplorerVisitor visits the AST nodes in the provided AST. For this stage, VariableExplorerVisitor visits MethodInvocation nodes. A MethodInvocation node represents a method invocation expression. Below is an example of a MethodInvocation node:

\begin{Verbatim}[baselinestretch=1,frame=single] FieldA.MethodA(FieldB) \end{Verbatim}

FieldA represents the node expression, MethodA represents the method invoked on FieldA, and FieldB represents the node argument.

FieldDeclaration, SingleVariableDeclaration, and MethodInvocation nodes. Below is a description of each: \begin{itemize}

\item FieldDeclaration: represents a field declaration statement. Below is an example of a FieldDeclaration node:

\begin{Verbatim}[baselinestretch=1,frame=single] ClassA FieldA \end{Verbatim}

ClassA represents the node type. FieldA  represents the node name.
\item VariableDeclarationStatement: represents a local variable declaration statement. Below is an example of a VariableDeclarationStatement node:

\begin{Verbatim}[baselinestretch=1,frame=single] Method(){

ClassA VariableA

} \end{Verbatim}

ClassA represents the node type. VariableA represents the node name.
\item SingleVariableDeclaration: Single variable declaration nodes are used in a limited number of places, including formal parameter lists and catch clauses. They are not used for field declarations and regular variable declaration statements. Below is an example of a SingleVariableDeclaration node:

\begin{Verbatim}[baselinestretch=1,frame=single] MethodA(ClassA VariableA) \end{Verbatim}

ClassA represents the node type. VariableA represents the node name.
\item MethodInvocation: represents a method invocation expression. Below is an example of a MethodInvocation node:

\begin{Verbatim}[baselinestretch=1,frame=single] FieldA.MethodA(FieldB) \end{Verbatim}

FieldA represents the node expression, FieldB represents the node argument.

\end{itemize} VariableExplorer produces a factbase specifying information about fields and their invoked methods. The output is a ternary relation called MethodInvocation. Below is an example of a MethodInvocation relation:

\begin{Verbatim}[baselinestretch=1,frame=single] MethodInvocation ClassC FieldA FieldB \end{Verbatim}

This means that ClassC invokes a method on FieldA passing FieldB as a parameter

\section{Stage 3: Static Facts Refinement and Integration}

\label{static_facts_integration}

For this stage, we use QL to integrate the static facts produced in the first stage using QL with the facts produced in the second stage by VariableExplorer. The corresponding script is available at \url{http://www.cse.yorku.ca/~haneend/FINDER/scripts/fact_integrate.ql}. \\Assume that we have the following Java code: \begin{Verbatim}[baselinestretch=1,frame=single] public class ClassA{

  ClassB variable1;
  List<ClassB> variable2;

} \end{Verbatim} This example represents a class ClassA that contains two variables: variable1 and variable2. variable1 is of type ClassB. variable2 is a strongly-typed list of type ClassB. The first stage will produce the following facts about ClassA:

\begin{Verbatim}[baselinestretch=1,frame=single] class_variable ClassA variable1 class_variable ClassA variable2 variable_descriptors variable1 ClassB variable_descriptors variable2 Collection \end{Verbatim} Note that the first stage is not able to capture the type of the objects in variable2. Instead, variable2 is expressed as a generic collection.
Next, we investigate the method invocations on generic collections to estimate the type of the objects in the collection. This is done by using the MethodInvocation relation provided by VariableExplorer. Assume that we have the following relation produced by VariableExplorer: ClassVariableType ClassA variable1 ClassB ClassVariableType ClassA variable3 Collection \begin{Verbatim}[baselinestretch=1,frame=single] MethodInvocation ClassA variable2 variable1 \end{Verbatim}

This example represents a method invocation on the collection variable2 in ClassA, passing variable1 as a parameter. Method invocation on collections are the add, remove, and other methods supported by the Collection class in Java, and the passed parameters usually represent an element in the collection. Based on the type of the passed element, we can determine the type of the objects in the generic collection. Since variable1 is of type ClassB, the generic-typed collection variable2 is changed to a strongly-typed collection of type ClassB.

One limitation of VariableExplorer is that it is not able to find the full variable type of method parameters. VariableExplorer gathers information from AST nodes, which contain information about field and variable types represented as strings. Assume that the full name for ClassB is ClassC.ClassB, and we have a the following FieldDeclaration node in ClassA:

\begin{Verbatim}[baselinestretch=1,frame=single] ClassB variable1 \end{Verbatim}

VariableExplorer will produce the following relation:

\begin{Verbatim}[baselinestretch=1,frame=single] ClassVariableType ClassA variable1 ClassB \end{Verbatim}

VariableExplorer will not get the full name of ClassB as ClassC.ClassB.
In this stage we solve this limitation by using the uses relationship described in the first stage. Assume that we have the following factbase:

\begin{Verbatim}[baselinestretch=1,frame=single] uses ClassA ClassC.ClassB ClassVariableType ClassA variable1 ClassB \end{Verbatim}

The first fact was created in the first stage, and the second fact was created in the second stage. The facts represent a class ClassA that uses a class ClassC.ClassB, where ClassA contains a a field of type ClassB. Since ClassA uses ClassC.ClassB, we can guess the full type of variable1 is ClassC.ClassB. And we modify strongly-typed collections with the variable full type.
The final factbase consists of all the relations produced by the second stage, in addition to the relations shown in Table \ref{integrate_relations}:

\begin{table} \caption{Relations produced by Stage 3 \label{integrate_relations}} \begin{center} {\scriptsize

\begin{tabular}{|l|l|p{5cm}|}

\hline
	\textbf{Relation Name}		& \textbf{Example}  & \textbf{Explanation}\\
\hline
class\_typed\_collections	&	class\_typed\_collections ClassA CollectionC ClassB &  ClassA contains a strongly-typed collection CollectionC of type ClassB. Moreover, if a class contains multiple separate references of another class, it is also considered a strongly-typed collection of the latter class type. \\
\hline
generic\_typed\_collections	&	generic\_typed\_collections ClassA CollectionC &  ClassA contains a generic-typed collection CollectionC.\\
\hline

\end{tabular} } \end{center} \end{table}

\section{Stage 4: Design pattern Instance Recovery}

\label{instance_recovery}

In the last stage, we use QL to filter the factbase for instance candidates that satisfy the static definition of design patterns that we want to detect. For each design pattern, we define its participating roles. And for each role, we define the relations that it needs to satisfy. These static definitions have been created by studying each design pattern's structure, intent, and implementation. These definitions are denoted using the FINDER notation discussed in section \ref{model}, and translated to rules written in QL script. These QL scripts are available at \url{http://www.cse.yorku.ca/~haneend/FINDER/scripts/ql/}. Following is an example of a QL script of a design pattern that consists of three roles:

\begin{Verbatim}[baselinestretch=1,frame=single] DP[ClassA,ClassB,ClassC] = {

//ClassA uses ClassB
uses[ClassA,ClassB]; 
//ClassB uses ClassC
uses[ClassB,ClassC]; 
//ClassA contains a method that returns ClassB
class_methods[ClassA, method_id];
method_return[method_id, ClassB];
//ClassB contains a static method that returns ClassC
class_methods[ClassB, method2_id];
method_static[method2_id];
method_return[method2_id, ClassC];

} \end{Verbatim}

The output of this script is a list of instance candidates that satisfy these rules. A single design pattern instance candidate consists of role candidate class names corresponding to the design pattern roles. The following shows an example of two candidates for the previous script:

\begin{Verbatim}[baselinestretch=1,frame=single] DP RoleA, RoleB, RoleC DP ClassD ClassE ClassF DP ClassG ClassH ClassI \end{Verbatim} The first line defines the design pattern roles: RoleA, RoleB, and RoleC. The other two lines represent two instance candidates . The number of classes in each line has to match the number of roles. The first instance candidate consists of ClassD (a candidate for RoleA), ClassE (a candidate for RoleB), and ClassF (a candidate for RoleC).
In addition to design pattern roles, design pattern instance candidates may contain a method or a field that has an important function in the design pattern, such as the factory method in the FactoryMethod design pattern, and the Collection object in the Composite design pattern. \section{Example} \begin{figure}[ht] \caption{State \label{example:model:state}} \begin{center} \includegraphics[height=3.3in]{state} \end{center} \end{figure} In this section, we show how the FINDER model of the State design pattern discussed in Section \ref{FINDER_state} is translated to a set of rules written in QL. The State design pattern detection script is available at \url{http://www.cse.yorku.ca/~haneend/FINDER/scripts/ql/state.ql}.
Below are the rules represented by this model followed by the QL script that they are translated to: \begin{enumerate} \item ConcreteState must inherit from State: \begin{Verbatim}[baselinestretch=1,frame=single] inherits[concreteState,state]; \end{Verbatim} \item ConcreteState must be concrete: \begin{Verbatim}[baselinestretch=1,frame=single] class_concrete[concreteState]; \end{Verbatim} \item Context must contain a field of type State: \begin{Verbatim}[baselinestretch=1,frame=single] class_variables[context, variable_id]; variable_descriptors[variable_id,state]; \end{Verbatim} \item Context must contain a method that calls a method in State: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[concreteState, method_id]; class_methods[context, method2_id]; method_invoke[method2_id, method_id]; \end{Verbatim} \item Variation1: \begin{itemize} \item CreationInContext:
Context must contain a method that creates a new ConcreteState: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[context, method_id]; method_new[method_id, concreteState]; \end{Verbatim} \item CreationInConcreteState:
ConcreteState must contain a method that creates a new ConcreteState: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[concreteState, method_id]; method_new[method_id, concreteState]; \end{Verbatim} Also, the Context gets State passed to it through method parameters: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[context, method_id]; method_parameters[method_id, state]; \end{Verbatim} Or by calling a method that returns a Context: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[context, method_id]; method_invoke_direct[method_id,method2_id]; method_return[method2_id, state]; \end{Verbatim} \end{itemize} \item Option1 (\textit{ContextUsage}):
ConcreteState must contain a reference to Context. Either by having a field of type Context: \begin{Verbatim}[baselinestretch=1,frame=single] class_variables[concreteState, variable_id]; variable_descriptors[variable_id,context]; \end{Verbatim} Or having Context passed to it through method parameters: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[concreteState, method_id]; method_parameters[method_id, context]; \end{Verbatim} Or by calling a method that returns Context: \begin{Verbatim}[baselinestretch=1,frame=single] class_methods[concreteState, method_id]; method_invoke_direct[method_id,method2_id]; method_return[method2_id, context]; \end{Verbatim} \end{enumerate} The result of the previous scripts is a relation called DP that consists of three roles: Context, State, and ConcreteState. This relation contains instance candidates for the State design pattern that satisfy the FINDER rules specified in the State FINDER definition. Each line represents a single instance candidate. Below is a sample output for the DP relation: \begin{Verbatim}[baselinestretch=1,frame=single] DP Context State ConcreteState

DP Class1 Class2 Class3
DP Class1 Class2 Class4
DP Class1 Class2 Class5
DP Class6 Class7 Class8
DP Class6 Class8 Class9

\end{Verbatim}

\begin{enumerate} \setcounter{enumi}{6}

	\item Post processing rule: There must be least two ConcreteState role candidates in the same State anchor instance:

\begin{Verbatim}[baselinestretch=1,frame=single] contextCandidates = dom DP for context in contextCandidates {

state_concreteState = {context} . DP
stateCandidates = dom state_concreteState
for state in stateCandidates
{
  concreteStates = {state} . state_concreteState
  if(#concreteStates[&0] < 2)
  {
    DP = DP - DP[&0 =~ context && &1 =~ state]
  }
}

} \end{Verbatim}

Running this post processing rule eliminates instance candidates that do not satisfy this rule. The last two instance candidates in DP will be deleted and the final output will be as follows: \pagebreak[1] \begin{Verbatim}[baselinestretch=1,frame=single] DP Context State ConcreteState DP Class1 Class2 Class3 DP Class1 Class2 Class4 DP Class1 Class2 Class5 \end{Verbatim} The results show that there are three instance candidates of the State design pattern. Class1 plays the role of Context, Class2 plays the role of State, and Class3, Class4, and Class5 play the role of Concrete States. \end{enumerate}