Finding duplicate files in the disk

Sometimes, when we open an Explorer window in the main computer, we see red bars in some disks, telling us that the disk is almost full and that we need to do some cleanup. We call the system cleanup, that removes some unused space, but this isn't enough to make things better.

So, we try to find the duplicate files in the disk to remove some extra space, but we have a problem: where are the duplicate files? The first answer is to check the files with the same name and size, but that isn't enough - files can be renamed, and still be duplicates.

So, the best thing to do is to find a way to find and list all duplicates in the disk. But how can we do this?

The naive approach is to get all files with the same size and compare them one with the other. But this is really cumbersome, because if there are 100 files in the group, there will be 100!/(2!98!) = 10099/2 = 4950 comparisons and has a complexity of O(n^2).

One other approach is to get a checksum of the file and compare checksums. That way, you will still have the O(n^2) complexity, but you'll have less data to compare (but you will have to compute the time to calculate the checksums). A third approach would be to use a dictionary to group the files with the same hash. The search in a dictionary has a O(1) complexity, so this would do a O(n) complexity.

Now, we only have to choose the checksum. Every checksum has a number of bits and, roughly, the larger the number of bits, the longer it takes to compute it. But the larger number of bits make it more difficult to get wrong results: if you are using CRC16 checksum (16 bits), you will have 65,535 combinations and the probability of two different files have the same checksum is very large. CRC32 allows 2,147,483,647 combinations and, thus, is more difficult to have a wrong result. You can use other algorithms, like MD5 (128 bits), SHA1 (196 bits) or SHA256 (256 bits), but computing these will be way longer than computing the CRC32 bits. As we are not seeking for huge accuracy, but for speed, we'll use the CRC32 algorithm to compute the hashes. A fast implementation of this algorithm can be found here , and you can use it by installing the CRC32C.NET NuGet package.

From there, we can create our program to find and list the duplicates in the disk. In Visual Studio, create a new WPF application. In the Solution Explorer, right-click on the references node and select the WpfFolderBrowser and Crc32C.NET packages. Then add this code in MainWindow.xaml:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <Button Width="85" Height="30" Content="Start" Click="StartClick"
                HorizontalAlignment="Right" Margin="5" Grid.Row="0"/>
    <Grid Grid.Row="1">
        <Grid.RowDefinitions>
            <RowDefinition Height="*"/>
            <RowDefinition Height="30"/>
        </Grid.RowDefinitions>
        <ScrollViewer HorizontalScrollBarVisibility="Disabled">
        <ItemsControl x:Name="FilesList" HorizontalContentAlignment="Stretch">
            <ItemsControl.ItemTemplate>
                <DataTemplate>
                    <Grid HorizontalAlignment="Stretch">
                        <Grid.RowDefinitions>
                            <RowDefinition Height="30" />
                            <RowDefinition Height="Auto" />
                        </Grid.RowDefinitions>
                        <TextBlock Text="{Binding Value[0].Length, StringFormat=N0}"
                                   Margin="5" FontWeight="Bold"/>
                        <TextBlock Text="{Binding Key, StringFormat=X}"
                                   Margin="5" FontWeight="Bold" HorizontalAlignment="Right"/>
                        <ItemsControl ItemsSource="{Binding Value}" Grid.Row="1" 
                                      HorizontalAlignment="Stretch"
                                      ScrollViewer.HorizontalScrollBarVisibility="Disabled"
                                      Background="Aquamarine">
                            <ItemsControl.ItemTemplate>
                                <DataTemplate>
                                    <TextBlock Text="{Binding FullName}" Margin="15,0"  />
                                </DataTemplate>
                            </ItemsControl.ItemTemplate>
                        </ItemsControl>
                    </Grid>
                </DataTemplate>
            </ItemsControl.ItemTemplate>
        </ItemsControl>
        </ScrollViewer>
        <StackPanel Grid.Row="1" Orientation="Horizontal">
            <TextBlock x:Name="TotalFilesText" Margin="5,0" VerticalAlignment="Center"/>
            <TextBlock x:Name="LengthFilesText" Margin="5,0" VerticalAlignment="Center"/>
        </StackPanel>
    </Grid>
</Grid>

In the button's click event handler, we will open a Folder browser dialog and, if the user selects a folder, we will process it, enumerating the files and finding the ones that have the same size. Then, we calculate the Crc32 for these files and add them to a dictionary, grouped by hash:

private async void StartClick(object sender, RoutedEventArgs e)
{
    var fbd = new WPFFolderBrowserDialog();
    if (fbd.ShowDialog() != true)
        return;
    FilesList.ItemsSource = null;
    var selectedPath = fbd.FileName;

    var files = await GetPossibleDuplicatesAsync(selectedPath);
     FilesList.ItemsSource = await GetRealDuplicatesAsync(files);
}

The GetPossibleDuplicatesAsync will enumerate the files and group them by size, returning only the groups that have more than one file:

private async Task<List<IGrouping<long, FileInfo>>> GetPossibleDuplicates(string selectedPath)
{
    List<IGrouping<long, FileInfo>> files = null;
    await Task.Factory.StartNew(() =>
    {
        files = GetFilesInDirectory(selectedPath)
                       .OrderByDescending(f => f.Length)
                         .GroupBy(f => f.Length)
                         .Where(g => g.Count() > 1)
                         .ToList();
    });
    return files;
}

GetFilesInDirectory enumerates the files in the selected directory:

private List<FileInfo> GetFilesInDirectory(string directory)
{
    var files = new List<FileInfo>();
    try
    {
        var directories = Directory.GetDirectories(directory);
        try
        {
            var di = new DirectoryInfo(directory);
            files.AddRange(di.GetFiles("*"));
        }
        catch
        {
        }
        foreach (var dir in directories)
        {
            files.AddRange(GetFilesInDirectory(System.IO.Path.Combine(directory, dir)));
        }
    }
    catch
    {
    }

    return files;
}

After we have the duplicate files grouped, we can search the real duplicates with GetRealDuplicatesAsync:

private static async Task<Dictionary<uint,List<FileInfo>>> GetRealDuplicatesAsync(
    List<IGrouping<long, FileInfo>> files)
{
    var dictFiles = new Dictionary<uint, List<FileInfo>>();
    await Task.Factory.StartNew(() =>
    {
        foreach (var file in files.SelectMany(g => g))
        {
            var hash = GetCrc32FromFile(file.FullName);
            if (hash == 0)
                continue;
            if (dictFiles.ContainsKey(hash))
                dictFiles[hash].Add(file);
            else
                dictFiles.Add(hash, new List<FileInfo>(new[] { file }));
        }
    });
    return dictFiles.Where(p => p.Value.Count > 1).ToDictionary(p => p.Key, p => p.Value);
}

The GetCrc32FromFile method with use the Crc32C library to compute the Crc32 hash from the file. Note that we can't compute the hash in one pass, by reading the whole file, as this will fail with files with more than 2Gb. So, we read chunks of 10,000 bytes and process them.

public static uint GetCrc32FromFile(string fileName)
{
    try
    {
        using (FileStream file = new FileStream(fileName, FileMode.Open))
        {
            const int NumBytes = 10000;
            var bytes = new byte[NumBytes];
            var numRead = file.Read(bytes, 0, NumBytes);
            if (numRead == 0)
                return 0;
            var crc = Crc32CAlgorithm.Compute(bytes, 0, numRead);
            while (numRead > 0)
            {
                numRead = file.Read(bytes, 0, NumBytes);
                if (numRead > 0)
                    Crc32CAlgorithm.Append(crc, bytes, 0, numRead);
            }
            return crc;
        }
    }
    catch (Exception ex) when (ex is UnauthorizedAccessException || ex is IOException)
    {
        return 0;
    }
}

Now, when you run the app, you will get something like this:

You can then verify the files you want to remove and then go to Explorer and remove them. But there is one thing to do here: the time to compute the hash is very large, especially if you have a lot of data to process (large files, large number of files or both). Could it be improved?

This issue is somewhat complicated to solve. Fortunately, .NET provide us with an excellent tool to improve performance in this case: Parallel programming. By making a small change in the code, you can calculate the CRC of the files in parallel, thus improving the performance. But there is a catch: we are using classes that are not thread safe. If you use the common Dictionary and List to store the data, you will end up with wrong results. But, once again, .NET comes to rescue us: it provides the ConcurrentDictionary and ConcurrentBag to replace the common classes, so we can store the data in a thread safe way. We can then change the code to this:

private static async Task<Dictionary<uint, List<FileInfo>>> GetRealDuplicatesAsync(
    List<IGrouping<long, FileInfo>> files)
{
    var dictFiles = new ConcurrentDictionary<uint, ConcurrentBag<FileInfo>>();
    await Task.Factory.StartNew(() =>
    {
        Parallel.ForEach(files.SelectMany(g => g), file =>
        {
            var hash = GetCrc32FromFile(file.FullName);
            if (hash != 0)
            {
                if (dictFiles.ContainsKey(hash))
                    dictFiles[hash].Add(file);
                else
                    dictFiles.TryAdd(hash, new ConcurrentBag<FileInfo>(new[] { file }));
            }
        });
    });
    return dictFiles.Where(p => p.Value.Count > 1)
        .OrderByDescending(p => p.Value.First().Length)
        .ToDictionary(p => p.Key, p => p.Value.ToList());
}

When we do that and run our program again, we will see that more CPU is used for the processing and the times to get the list come to 46 seconds from 78 seconds (for 18GB of duplicate files).

Conclusions

With this program, we can show the largest duplicates in a folder and see what can be safely deleted in our disk, thus retrieving some space (in our case, we would have potentially got 9Gb extra). We've done some optimization in the code by parallelizing the calculations using the parallel extensions in .NET.

The source code for this article is in https://github.com/bsonnino/FindDuplicates

Conclusions

Leave a Reply Cancel reply