Simple Photo De-duplication with PowerShell
/After several failed attempts and false starts over the course of the last decade or so, this summer I vowed to finally get my wife's photo library and my own properly merged. For various reasons, we both had large libraries of photos from various events and vacations which had a lot of overlap, but not 100% overlap.
This meant I needed to examine two large collections of photos and copy all of the non-duplicate photos from one to the other. And I couldn't rely on file names or paths at all, because my wife is good about renaming and organizing her photos, while I am not. So I whipped up a PowerShell script to gather the SHA-1 hashes of every file; by comparing them all, I could find and ignore the duplicates. It's not perfect - it'll only find pictures which are exact duplicates. If my collection has the original and my wife's collection just has the "red-eye reduction" version, we'll end up with both in the final collection. But it considerably reduced the amount of de-duplication work we have to do by hand.
Here's the function which actually gathers all the photo data for a folder (and all its child folders):
function Get-PhotoData {
param([string]$path)
$results = @()
$files = Get-ChildItem $path -Recurse -Filter *.jpg
$total = ($files | measure).Count
$files | % {$i=1} {
Write-Host "Processing $_ ($i of $total)"
$props = @{
Name = $_.Name
Path = $_.FullName
Size = $_.Length
Hash = (Get-FileHash $_.FullName -Algorithm SHA1).Hash
}
$results += (New-Object PSObject -Property $props)
$i++
}
$results
}
I ran that function against my wife's photo collection (which was effectively our master collection) and my own:
$master = Get-PhotoData -path $masterPath
$toMerge = Get-PhotoData -path $toMergePath
In theory, I could then have used Compare-Object to figure out which items in my collection were safe to delete (i.e., they already existed in my wife's collection):
$safeToDelete = Compare-Object -IncludeEqual -ExcludeDifferent -ReferenceObject $toMerge -DifferenceObject $master -Property Hash -PassThru | Select-Object -ExpandProperty Path
This would give me a list of paths to photos in my collection which had a matching SHA1 hash to a photo already in my wife's collection.
Or, I could find the list of items in my collection which were missing from her collection:
$toMove = Compare-Object -ReferenceObject $master -DifferenceObject $toMerge -Property Hash -PassThru | ? { $_.SideIndicator -eq '=>' } | Select-Object -ExpandProperty Path
Moving each item in that collection would then be easy:
$toMove | % {$i=1} {
Write-Host "Moving $_ ($i of $total)"
Move-Item $_ "[destination path]"
$i++
}
For small enough collections, this works great. But if you've got a large enough photo collection you might start running into performance problems with Compare-Object. If that's the case, with a little extra effort and a little bit of Python you can figure out your $safeToDelete
list much faster. First, we dump our photo data to a couple of files:
$master | Select Name, Path, Size, Hash | Export-Csv -NoTypeInformation "master.csv"
$toMerge | Select Name, Path, Size, Hash | Export-Csv -NoTypeInformation "toMerge.csv"
Now we throw together a quick Python program using pandas to read in the two data sets, merge them into a single data set by matching on the file size and hash, and dump the output to another file:
import pandas as pd
# Read in our email data file
master = pd.read_csv('../master.csv', header = 0)
toMerge = pd.read_csv('../toMerge.csv', header = 0)
both = pd.merge(toMerge, master, on=['Size', 'Hash'])
both.to_csv('../safeToDelete.csv')
The new .csv
file will have columns 'Path_x' and 'Path_y'; since we had toMerge
as the first parameter to merge, Path_x
is a list of all the files in that collection which can be deleted. More Python-savvy folks than me can probably handle the deletion straight from the Python script, but I just did it with PowerShell:
$toDelete = (Import-Csv .\safeToDelete.csv).Path_x
$total = ($toMove | measure).Count
$toDelete | % {$i=1} {
Write-Host "Deleting $_ ($i of $total)"
Remove-Item $_
$i++
}
Of course, don't go running any of this code or deleting any files until you've backed your folders up somewhere safe; if you make any mistakes (or any of my code is totally broken), you'll want a safety net in place.