Simple Photo De-duplication with PowerShell

After several failed attempts and false starts over the course of the last decade or so, this summer I vowed to finally get my wife's photo library and my own properly merged. For various reasons, we both had large libraries of photos from various events and vacations which had a lot of overlap, but not 100% overlap.

This meant I needed to examine two large collections of photos and copy all of the non-duplicate photos from one to the other. And I couldn't rely on file names or paths at all, because my wife is good about renaming and organizing her photos, while I am not. So I whipped up a PowerShell script to gather the SHA-1 hashes of every file; by comparing them all, I could find and ignore the duplicates. It's not perfect - it'll only find pictures which are exact duplicates. If my collection has the original and my wife's collection just has the "red-eye reduction" version, we'll end up with both in the final collection. But it considerably reduced the amount of de-duplication work we have to do by hand.

Here's the function which actually gathers all the photo data for a folder (and all its child folders):

function Get-PhotoData {
    param([string]$path)

    $results = @()

    $files = Get-ChildItem $path -Recurse -Filter *.jpg 

    $total = ($files | measure).Count

    $files | % {$i=1} {
        Write-Host "Processing $_ ($i of $total)"

        $props = @{
            Name = $_.Name
            Path = $_.FullName
            Size = $_.Length
            Hash = (Get-FileHash $_.FullName -Algorithm SHA1).Hash
        }

        $results += (New-Object PSObject -Property $props)
        $i++
    }

    $results
}

I ran that function against my wife's photo collection (which was effectively our master collection) and my own:

$master = Get-PhotoData -path $masterPath
$toMerge = Get-PhotoData -path $toMergePath

In theory, I could then have used Compare-Object to figure out which items in my collection were safe to delete (i.e., they already existed in my wife's collection):

$safeToDelete = Compare-Object -IncludeEqual -ExcludeDifferent -ReferenceObject $toMerge -DifferenceObject $master -Property Hash -PassThru | Select-Object -ExpandProperty Path

This would give me a list of paths to photos in my collection which had a matching SHA1 hash to a photo already in my wife's collection.

Or, I could find the list of items in my collection which were missing from her collection:

$toMove = Compare-Object -ReferenceObject $master -DifferenceObject $toMerge -Property Hash -PassThru | ? { $_.SideIndicator -eq '=>' } | Select-Object -ExpandProperty Path

Moving each item in that collection would then be easy:

$toMove | % {$i=1} {
    Write-Host "Moving $_ ($i of $total)"
    Move-Item $_ "[destination path]"
    $i++
}

For small enough collections, this works great. But if you've got a large enough photo collection you might start running into performance problems with Compare-Object. If that's the case, with a little extra effort and a little bit of Python you can figure out your $safeToDelete list much faster. First, we dump our photo data to a couple of files:

$master | Select Name, Path, Size, Hash | Export-Csv -NoTypeInformation "master.csv"
$toMerge | Select Name, Path, Size, Hash | Export-Csv -NoTypeInformation "toMerge.csv"

Now we throw together a quick Python program using pandas to read in the two data sets, merge them into a single data set by matching on the file size and hash, and dump the output to another file:

import pandas as pd

# Read in our email data file
master = pd.read_csv('../master.csv', header = 0)
toMerge = pd.read_csv('../toMerge.csv', header = 0)

both = pd.merge(toMerge, master, on=['Size', 'Hash'])
both.to_csv('../safeToDelete.csv')

The new .csv file will have columns 'Path_x' and 'Path_y'; since we had toMerge as the first parameter to merge, Path_x is a list of all the files in that collection which can be deleted. More Python-savvy folks than me can probably handle the deletion straight from the Python script, but I just did it with PowerShell:

$toDelete = (Import-Csv .\safeToDelete.csv).Path_x

$total = ($toMove | measure).Count

$toDelete | % {$i=1} {
    Write-Host "Deleting $_ ($i of $total)"
    Remove-Item $_ 
    $i++
}

Of course, don't go running any of this code or deleting any files until you've backed your folders up somewhere safe; if you make any mistakes (or any of my code is totally broken), you'll want a safety net in place.