Find and Delete Duplicate Photos with Power Query - Excelerator BI

Find and Delete Duplicate Photos with Power Query

I was trying to clean up my hard disk drive on my PC the other day.  I use a program called WinDirStat to show me graphically what is taking space (as an aside, this is great software and you can download it here).  After running the software I noticed that I had stored photos from my phone in a few different places.  Each of the 3 sections below are images from my phone stored in different locations

image

Unfortunately the names of the files were different in the different locations as can be seen side by side below.  I think this is as a result of me using different software to transfer the photos across from time to time.  Apple really doesn’t go out of its way to make it easy, of course.

image

So the problem is, how could I delete the doubled up files without deleting any photos that were truly different.  There are some great image management tools out there – I personally love EXIF Tool  https://www.sno.phy.queensu.ca/~phil/exiftool/    But I wanted to use Power Query to solve this problem.

Power Query Online Training

The Approach I Took

  1. Create a query that links to each folder using Power Query for Excel, and set the list to “do not load”.
  2. Focusing on 1 folder at a time, I joined the list in the first query to the list from another query to flush out the duplicates.
  3. I then generated a batch file from the resulting list of matching (duplicate) files.
  4. I ran the batch file to delete the duplicates

Load each Folder of File Names

This is pretty standard Power Query stuff.  Just select, Data\Get Data\From File\From Folder.  Once the query is created,  set the query to “do not load”.  Repeat for each other main folder location.  The only thing I did that was a bit “non-standard” was to extract the file size from the attributes column.  I just clicked the expand button (1) and then deselected all the options just keeping size (2 below).

image

Merge Two Queries

The next thing was to merge 2 queries together.  I did this by selecting 3 columns to match in the merge.  I figure if the extension, the date modified, and the size are all the same, then it must be the same image.  I did an “inner join”, hence just keeping those that match.

image

Note that I didn’t use date created, as this is the date the file was copied to my PC rather than the date the photo was created on the phone.

In this merge query, I also created a full path to each file I wanted to delete by merging the path and file name column as shown below.

image

And I then prepended the DOS command del (including a space) to the front of the file names.

image

This left me with the exact batch commands I needed to delete the duplicates

image

Before moving on, I decided to do some quick checks to make sure what I had done was what I expected.  I duplicated the merge query. In this new copy, I edited the query to give me a side by side list of the full file path and names for both folder locations.  I could then use this list to to a spot check on the images.

image

I picked a few rows at random and checked both images.  Everything was fine.

Create the Batch File

This is pretty simple.  I simply cut and pasted the “to be deleted” list into notepad and then saved the file.  I called the file c:\users\matt\delme.bat

image

Finally I ran the batch file (found it in Windows Explorer and double clicked to execute). Actually the first attempt didn’t work because I had to add double quotes around the file names. I fixed that in my query and repeated the process. Then I went back to my Power Query workbook and hit refresh – no more duplicates

Before 416.6 GB space used

After deletions it was 373.3 GB, a space savings of more than 40GB – well worth the effort.

Want to Learn Power Query in a Structured Way?

My Power Query Online training course consists of 7.5 hours of online video training material that will teach you everything you need to know about using this fabulous tool for both Power BI and Excel. You can watch some free sample videos and find out more about registration following the link above.

7 thoughts on “Find and Delete Duplicate Photos with Power Query”

  1. Actually, the first attempt failed because I needed to include double quotes around the file names. I corrected it in my query and repeated the procedure. Then I returned to my Power Query worksheet and refreshed it – no more duplicates!

  2. Hi Matt,
    That’s nifty! I never thought to include DOS batch commands right in the query. I do something similar for downloaded wallpaper files. I always delete them manually.

    For my next project, MP3s, I am going to try your method!

    Cheers,

    Mitch

  3. Claude Van horn

    This sounds great! but a couple of questions–if you have the same picture stored in two or three places under different file names. when you want to go back and use that image later, how do you know where it is and what you called it? Wouldn’t you have to then go back and copy all the image files into one standard folder and see what each image is? What if you linked an html URL to one of those images? Wouldn’t this procedure break some of them?

    Just saying 🙂

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top