dupFinder Command-Line Tool

dupFinder is a free command line tool that finds duplicates in C# and Visual Basic .NET code - no more, no less. But being a JetBrains tool, dupFinder does it in a smart way. By default, it considers code fragments as duplicates not only if they are identical, but also if they are structurally similar, even if they contain different variables, fields, methods, types or literals. Of course, you can configure the allowed similarity as well as the minimum relative size of duplicated fragments.

Running Duplicate Analysis

To get dupFinder running with your custom CI, version control, quality control or any other server, download and unzip the command line tools package, and then run

dupFinder.exe <source> -o=<PathToOutputFile>

Where source defines files included into the duplicates search. Use Visual Studio solution or project files, Ant-like wildcards or specific source file and folder names. Paths should be either absolute or relative to the working directory. For example:

dupfinder.exe c:\MyProjects\MySolution.sln
dupfinder.exe "c:\MyProjects1\\**\*.cs"
dupfinder.exe **\*
dupfinder.exe FooFolder\src BarFolder\src

Configuring dupFinder Options

With optional parameters, you can configure the way dupFinder analyzes the sources:

dupFinder.exe [OPTIONS] <source>

To study the full list of options, run dupFinder.exe --help. Here are some options that you may be interested in:

--output (-o) — lets you set the output file.
--exclude (-e) — allows excluding files from the duplicates search. The value is a semicolon-delimited wildcards (for example, *Generated.cs). Note that the paths should be either absolute or relative to the working directory.
--exclude-by-comment — allows excluding files that have a matching substrings in the opening comments. Multiple values are separated with the semicolon.
--exclude-code-regions — allows excluding code regions that have a matching substrings in their names. (e.g. ‘generated code’ will exclude regions containing ‘Windows Form Designer generated code’). Multiple values are separated with the semicolon.
--discard-fields, --discard-literals, --discard-local-vars, --discard-types — these options lets you specify whether to consider similar fragments as duplicates if they have different variables, fields, methods, types or literals. The default value for all of them is ‘false’. To illustrate the way it works, consider the following example. There are two code fragments otherwise identical, one contains myStatusBar.SetText("Logging In...");, the other contains myStatusBar.SetText("Not Logged In");. If 'discard-literals' is 'false' (which it is by default) these fragments are considered duplicates. To exclude such items from the list of detected duplicates, you can use --discard-literals=true.
--discard-cost — allows setting a threshold for code complexity of the duplicated fragments. The fragments with lower complexity are discarded as non-duplicates. The value for this option is provided in relative units. It is calculated using an internal algorithm, which basically builds a syntax tree of the analyzed code (it could be built upon a file, project, or solution). So the value is proportional to the size of this tree or its corresponding branch(es). It is somewhat similar to the 'Cyclomatic complexity'.
Using the --discard-cost option, you can filter out equal code fragments that present no semantic duplication. E.g. you can often have the following statements in tests: Assert.AreEqual(gold, result);. If the ‘discard-cost’ value is less than 10, statements like that will appear as duplicates, which is obviously unhelpful. You’ll need to play a bit with this value to find a balance between avoiding false positives and missing real duplicates. The proper values will differ for different codebases.
--properties — lets you override MSBuild properties. You can set each property separately (--properties:prop1=val1 --properties:prop2=val2), or use a semicolon to separate multiple properties (--properties:prop1=val1;prop2=val2). The specified properties are applied to all analyzed projects. Currently, there is no direct way to set a property to a specific project only. The workaround is to create a custom property in this project and assign it to the desired property, then use the custom property in dupFinder parameters.
--normalize-types — allows normalizing type names to the last subtype in the output (default: False).
--show-text — if you use this parameter, the detected duplicate fragments will be embedded into the report.
--config-create and --config — these options allow you to pass the parameters described above with a configuration file. The first option will create a configuration file according to the current parameters; the second option is used to load the parameters from this file.

Here is an example of running dupFinder with optional parameters:

dupfinder.exe --discard-literals=true --caches-home="C:\Temp\DFCache" --o="report.xml" --discard-cost=50 "C:\src\MySolution.sln"

Understanding Output

The output of dupFinder analysis is a single XML file that presents the following information:

The Statistics node is an overview of the analyzed code, where CodeBaseCost is the relative size of the target source code, TotalFragmentsCost is the relative size of the code for analysis after applying filters (’discard-cost’, ‘discard-literals’, etc.), and TotalDuplicatesCost is the relative size of the detected duplicates.
The ...Cost values are calculated using an internal algorithm, which basically builds a syntax tree of the analyzed code (it could be built upon a file, project, or solution). So the value is proportional to the size of this tree or its corresponding branch(es). It is somewhat similar to the 'Cyclomatic complexity'.
The Duplicates node contains Duplicate nodes, which in turn contain two or more Fragment elements representing actual code duplicates.
Each Duplicate node has a Cost attribute - the duplicates with greater cost are the most important ones as they potentially present greater problems.
Each Fragment element contains the file name as well as the duplicated piece presented in two alternative ways - as a file offset range in symbols and as a line range. If the --show-text option was enabled for the analysis, then a Text node with the duplicated code is added to each fragment.

Practical Usage Example

In the steps described below we’ll take a solution, e.g. SolutionWithDuplicates.sln and see how to start duplicate analysis using an MSBuild target with a simple HTML report based on the dupFinder output.

Step 1

First, we unzip the command line tools package somewhere, e.g. in C:\programs\CLT.

Step 2

Now let’s think ahead to processing the dupFinder output. If we leverage the --show-text option, we’ll be able to build an HTML report by applying an XSL transformation to the dupFinder XML output; something like this will do:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="html" indent="yes" />
    <xsl:template match="/">
        <html>
            <body>
                <h1>Statistics</h1>
                <p>Total codebase size: <xsl:value-of select="//CodebaseCost"/></p>
                <p>Code to analyze: <xsl:value-of select="//TotalDuplicatesCost"/></p>
                <p>Total size of duplicated fragments: <xsl:value-of select="//TotalFragmentsCost" /></p>
                <h1>Detected Duplicates</h1>
                <xsl:for-each select="//Duplicates/Duplicate">
                    <h2>Duplicated Code. Size: <xsl:value-of  select="@Cost"/></h2>
                    <h3>Duplicated Fragments:</h3>
                    <xsl:for-each select="Fragment">
                        <xsl:variable name="i" select="position()"/>
                        <p>Fragment <xsl:value-of select="$i"/>  in file <xsl:value-of select="FileName"/></p>
                        <pre><xsl:value-of select="Text"/></pre>
                    </xsl:for-each>
                </xsl:for-each>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

We put this XSL stylesheet with the rest of the tools into C:\programs\CLT.

Step 3

The simplest way to run duplicate analysis and the following transformation is specify a new MSBuild target. We are now in the solution directory, so we go into one of its project subdirectories and open the project file (*.csproj) with a text editor, then add the following element into the root <Project> node:

<Target Name="AfterBuild">
  <Exec
    WorkingDirectory=".."
    Command="C:\programs\CLT\dupfinder.exe /output=&quot;dupReport.xml&quot; /show-text &quot;SolutionWithDuplicates.sln&quot;"/>
  <XslTransformation XslInputPath="C:\programs\CLT\dupFinder.xsl" XmlInputPaths="..\dupReport.xml" OutputPaths="..\dupReport.html"/>
</Target>

In this build target, which executes after the project build is finished, we move the working directory one folder up, from the project directory to the solution directory, run dupFinder, and then apply XSL transformation to the dupFinder output using our XSL stylesheet.

Step 4

Finally, all we have to do is to build our solution. If everything goes right, we’ll get two new files in the solution directory: dupReport.xml (generated by dupFinder) and dupReport.html. If we open dupReport.html, we can study all detected duplicates right in a web browser:

dupFinder report transformed into an HTML page

This simple example can be extended and customized in many ways, but we hope it shows you that there is nothing difficult in integrating duplicate analysis into your workflow.

Supported languages

dupFinder supports the following languages:

Last modified: 21 December 2018

Concepts:

How tos:

Detect code issues in a build using ReSharper and TeamCity

External Links:

CAS policy exceptions during CLT running

ReSharper 2018.2 Help