I was trying to clean up my hard disk drive on my PC the other day. I use a program called WinDirStat to show me graphically what is taking space (as an aside, this is great software and you can download it here). After running the software I noticed that I had stored photos from my phone in a few different places. Each of the 3 sections below are images from my phone stored in different locations
Unfortunately the names of the files were different in the different locations as can be seen side by side below. I think this is as a result of me using different software to transfer the photos across from time to time. Apple really doesn’t go out of its way to make it easy, of course.
So the problem is, how could I delete the doubled up files without deleting any photos that were truly different. There are some great image management tools out there – I personally love EXIF Tool https://www.sno.phy.queensu.ca/~phil/exiftool/ But I wanted to use Power Query to solve this problem.
The Approach I Took
- Create a query that links to each folder using Power Query for Excel, and set the list to “do not load”.
- Focusing on 1 folder at a time, I joined the list in the first query to the list from another query to flush out the duplicates.
- I then generated a batch file from the resulting list of matching (duplicate) files.
- I ran the batch file to delete the duplicates
Load each Folder of File Names
This is pretty standard Power Query stuff. Just select, Data\Get Data\From File\From Folder. Once the query is created, set the query to “do not load”. Repeat for each other main folder location. The only thing I did that was a bit “non-standard” was to extract the file size from the attributes column. I just clicked the expand button (1) and then deselected all the options just keeping size (2 below).
Merge Two Queries
The next thing was to merge 2 queries together. I did this by selecting 3 columns to match in the merge. I figure if the extension, the date modified, and the size are all the same, then it must be the same image. I did an “inner join”, hence just keeping those that match.
Note that I didn’t use date created, as this is the date the file was copied to my PC rather than the date the photo was created on the phone.
In this merge query, I also created a full path to each file I wanted to delete by merging the path and file name column as shown below.
And I then prepended the DOS command del (including a space) to the front of the file names.
This left me with the exact batch commands I needed to delete the duplicates
Before moving on, I decided to do some quick checks to make sure what I had done was what I expected. I duplicated the merge query. In this new copy, I edited the query to give me a side by side list of the full file path and names for both folder locations. I could then use this list to to a spot check on the images.
I picked a few rows at random and checked both images. Everything was fine.
Create the Batch File
This is pretty simple. I simply cut and pasted the “to be deleted” list into notepad and then saved the file. I called the file c:\users\matt\delme.bat
Finally I ran the batch file (found it in Windows Explorer and double clicked to execute). Actually the first attempt didn’t work because I had to add double quotes around the file names. I fixed that in my query and repeated the process. Then I went back to my Power Query workbook and hit refresh – no more duplicates
Before 416.6 GB space used
After deletions it was 373.3 GB, a space savings of more than 40GB – well worth the effort.