Any tips on speeding up finding duplicate images?

The best solution for finding and removing duplicate files.
Soeroah
Posts: 11
Joined: Sat Apr 16, 2022 5:08 pm

Any tips on speeding up finding duplicate images?

Post by Soeroah »

I've been using this software for about two years now, and over time my documents folder has been ballooning out somewhat. I tend to add metadata to files to make things more searchable, so when I use Duplicate Cleaner I make sure I'm scanning those, too, because I want to make sure even if a 'new' image is actually a copy of one I already have stored, I want to make sure whichever one I keep is a) the highest quality version and b) has the most up-to-date, useful metadata/comments for added searchability.

I tend to end up running the software multiple times a week, but at this stage it's taking close to or in excess of an hour to run each time. There's a virtual folder function, but it doesn't work for Image Mode searching, so it's essentially useless for my purposes - I've tried doing a search using general and it's not as reliable, unfortunately.

I've taken to making a second copy of my folders that removes a number of items I don't think I need to be scanned with great frequency, to scan against that and try to reduce my scan times, but I was wondering if there are any other tips I could use on speeding up the scan without sacrificing reliability.

Thanks.


Edit: Also, my windows index randomly reset itself shortly before I started running this scan - does that also affect the image metrics caching? This scan taking this much longer than I'm used to would make sense if the DP caching somehow reset - I swear this took about 15 minutes the other day, but it's going to be over an hour today.
User avatar
DigitalVolcano
Site Admin
Posts: 1863
Joined: Thu Jun 09, 2011 10:04 am

Re: Any tips on speeding up finding duplicate images?

Post by DigitalVolcano »

Which bit of the scan is slow? The image metrics are cached, but the metadata is not currently. People have said the metadata gathering part of the scan can be slow, which is why we're aiming to add metadata caching to a future update.
Moving/copying your files to a new location and scanning that will be slower as the caches are only valid when a file is in the same location as before.

The Windows indexing shouldn't affect the caching.
canman
Posts: 6
Joined: Fri Jul 07, 2023 9:38 am

Re: Any tips on speeding up finding duplicate images?

Post by canman »

Hi There,

Any progress on your efforts to build caching the metadata into the program? For those of us who use DCP primarily for image cleanup, argurably this would be the #1 feature we would like to see added (as it can take upwards of 20 minutes for it to scan and rescan the metadata on a large collection of photos). For me at least, I almost always use the metadata (in particular the "Date Taken" field) to find duplicates, as using the image itself matches and/or date modified/created or filename usually produces a lot of false positives.

Cheers,
Canman
MegMac
Posts: 20
Joined: Tue Sep 14, 2021 6:00 pm

Re: Any tips on speeding up finding duplicate images?

Post by MegMac »

If you are checking for duplicates in your own collection on a regular basis, What are the DIFFERENCES you are looking for between images?
I can probably help you when I know the answer.
canman
Posts: 6
Joined: Fri Jul 07, 2023 9:38 am

Re: Any tips on speeding up finding duplicate images?

Post by canman »

Hi There

I generally use the exif date/time field. I find this is the most effective way to identify duplicates. Most of my duplicates are caused by my uploads to OneDrive and/or Google Drive going screwy on occasion.

Cheers
Canman
MegMac
Posts: 20
Joined: Tue Sep 14, 2021 6:00 pm

Re: Any tips on speeding up finding duplicate images?

Post by MegMac »

Canman,
Sorry it took me so long to reply.
The Exif DateTimeOriginal field will not change or be stripped of the date just by uploading/downloading to OneDrive or Google Drive.
So, I don't understand how that would help - or maybe I just don't understand the problem.

If photos do have an Exif DateTimeOriginal, the Date taken (Date/Time Taken) will be accurate. If you take bursts photos or click the shutter more than once in the same second, use the subseconds field for more accurate results. However, some or all subseconds digits can get stripped from the metadata (and someday I'll figure put which applications or cloud storage services do that).
canman
Posts: 6
Joined: Fri Jul 07, 2023 9:38 am

Re: Any tips on speeding up finding duplicate images?

Post by canman »

No problem on the delay responding. I appreciate any response no matter how much time has passed.

So just to clarify... my issue isn't with the exif data being stripped or anything like that. OneDrive and Google drive are fine keeping the exif daya... What they are not so good at is uploading multiple copies of the same image (despite having supposed deduplication functionality themselves).. ImMy issue/request is solely with the amount of time it takes to scan (and then rescan) the metadata. With over 50K photos in my collection, it can take upwards of 30 mins or more sometimes to run through the exif data scanning process. Most of the time, I need to do multiple runs with different parameters (often, but not always, using exif date/time as the main parameter) to find the dupes. So having the program cache the metadata so that it doesn't have the repeat the scan every time would be a very welcome and important feature for me (and a lot of other people I think)

I hope that makes it clearer and I look forward to hearing back from you whenever you have a chance.

Cheers
Canman
canman
Posts: 6
Joined: Fri Jul 07, 2023 9:38 am

Re: Any tips on speeding up finding duplicate images?

Post by canman »

Sorry... And just to add, part of the problem is that OneDrive and/or Google drive sometimes adds 100s if not 1000s of photos back into my collection from the cloud after I have used DFP to remove them from the drive on my local machine. So I occasionally need to re-run your program 3-4 times using various parameters to get rid of the dupes. Hence why the request on caching the exif data to speed this whole process up ...

Hope that helps to clarify...
Post Reply