Best practice when dealing with millions of photos

The best solution for finding and removing duplicate files.
bsacco
Posts: 63
Joined: Sun Jan 02, 2022 9:47 pm

Best practice when dealing with millions of photos

Post by bsacco »

What are the best steps to take when approaching a huge amount file duplication of family photos and videos.

The easiest way to dedupe photos and videos is to put them into ONE folder and run DCP5. Yes, BUT my folders are now too large and it crashes the program.

Does anyone know the steps to follow and settings required in DCP5 for matching and pairing down my huge number of duplicates?
bsacco
Posts: 63
Joined: Sun Jan 02, 2022 9:47 pm

Re: Best practice when dealing with millions of photos

Post by bsacco »

Bump...Still hoping to get a highly needed response to my question.

ANyone?
User avatar
DigitalVolcano
Site Admin
Posts: 1863
Joined: Thu Jun 09, 2011 10:04 am

Re: Best practice when dealing with millions of photos

Post by DigitalVolcano »

Try just running it in Regular Mode, Same content for initial paring down of files. This should cope with large amounts of files. You can even reduce the load at this stage by limiting it to one file-type at a time (e.g. jpg).
bsacco
Posts: 63
Joined: Sun Jan 02, 2022 9:47 pm

Re: Best practice when dealing with millions of photos

Post by bsacco »

I'm not exactly sure why this program has zero support. I mean, why have a forum if you don't respond to customer questions?

I am very disappointed and feel betrayed as a paying customer.
punar
Posts: 13
Joined: Sun May 22, 2016 5:02 pm

Re: Best practice when dealing with millions of photos

Post by punar »

I'm not exactly sure why this program has zero support. I mean, why have a forum if you don't respond to customer questions?
But you did get a reply from DV just two hours before...
The easiest way to dedupe photos and videos is to put them into ONE folder and run DCP5. Yes, BUT my folders are now too large and it crashes the program.
No, don't put them in one folder.
Best practice when dealing with millions of photos
Not easy to answer because it can be done in different ways and your scenario could be very different to others.

So let me answer how I would do it if it was my computer and then you can decide for yourself what is relevant.

Location:
Store all photos and videos on an internal disk and use external disks only for backup.
Preferably use the Pictures folder that is for that purpose as the main location.

From now on I will assume your main location is \User\Pictures

Clean the main location
If you have some large folders within \User\Pictures that you know are probably unsorted dupes, move them to a new folder not within \User\Pictures, for example \User\ProbablyDupes

In Duplicate Cleaner
- go to Scan Location and select \User\Pictures and \User\ProbablyDupes
(or since you have millions of files, perhaps do each one separately before selecting both of them)
- go to Scan Criteria and select Regular Mode, Same Content
- Go to Scan and click Start Scan
- Wait
- In the Scan Result, click a file that is within \User\ProbablyDupes
- In the Selection Assistant on the left side, under Mark by location, click the little button "Get selected folder name from duplicate list", and then under "Mark the files in this folder that has duplicates elsewhere" make sure "Also preserve (unmark) files elsewhere" is checked and click Mark
- Go through the list and see if it looks good
What you look for when you skim through is file names and folder names so you don't lose naming changes you have done:
ie1 if you have duplicate1 "DCIM1234.jpg" and duplicate2 "Summer2017.jpg", you want the generic name DCIM1234 to be tagged and deleted.
ie2 if you have a duplicate inside folder1 "\pictures\misc" and folder2 "\pictures\2017\summer\beach" then you want the ones in "\pictures\misc" to be tagged.
- Don't select any more files now
- Under Delete, uncheck "delete to recycle bin". At least one copy of the original photo will always remain on your computer (unless you force Duplicate Cleaner).

If you still have files in the duplicate list after this, use other criteria
- From the selection assistant on the left side to select chunks of files. For example
"File name" "contains" "Copy"
"Shortest file names in each group"
"Shortest folder path in each group"
- Also right click a file and select "Mark all files in this folder"
- Also go back to Scan Criteria and select Image Mode, Exact Match, and rescan

After each selection,
* Go through and check what has been selected as described above
* Delete (Delete after each chunk

When the duplicate list is clean, move whatever is left in \User\ProbablyDupes back to \User\Pictures

I guess this will get you started.
Hope you find this useful.
user12345
Posts: 3
Joined: Sat Jan 11, 2025 5:35 pm

Re: Best practice when dealing with millions of photos

Post by user12345 »

@bsacco

Here are a few pointers for I can provide you with large scale (as in file numbers and sizes) deduplication; the whole process is onerous and time-consuming *if* you want to ensure each and every unique asset is retained and only duplicates are deleted; therefore the method I use is very detailed and here are my top tips - as I am now on my second recent major round of data cleansing, having deduped and then disposed of probably close to a dozen storage devices containing TB's of data.

- if possible make sure you have a backup/redundancy of all files - mistakes, software errors and hardware failures happens; also it's likely you have multiple external discs so consider how old and reliable these are before starting - take your time like I do when there is a disc issue

- segmenting the files is a starting point; sort the gigantic files and put them together; if mixed in with other files and if possible, duplicate folders and then delete all other non-gigantic files; look at how many and if you can manually sort and delete first off; you should know the content so you can quickly assess if they are the same files

- then segment the data by file type and size to start cleansing; grab paper and write down what you've done: for example starting with .jpeg files: set the software program to small, then medium, then large and so on; if the meta data is still intact use dates also - all aiming to create a dataset that is manageable to cleanse. By manageable I mean what your pc can manage power and speed-wise and what your discs cam mange speed-wise. On this I just bought a new HD for this round and the process of migrating data to it still took days

- follow the process for the latter point for each of the file types and sizes (ticking them off on that bit of paper) until you begin to really reduce duplicate files and you see the TBs decreasing. Only when you get to sensible file numbers/files size can you really begin broader file deletion criteria

- consider optimising your pc along the way to empty the bin and clear the cache; I also have used file re-naming software if there are too many duplicate file names - again to ensure accuracy; remember hardware can get hot - slowing down things, so be careful; also disable sleep and other pc variables that could disrupt the process

Hope this helps and addresses the tumbleweed

Good luck
bsacco
Posts: 63
Joined: Sun Jan 02, 2022 9:47 pm

Re: Best practice when dealing with millions of photos

Post by bsacco »

Thank you for the much needed response.

I’m currently stuck running a dedupe on that large folder in image mode where I’ve completed 1m files of 2.7m files but it’s been chugging along for one week so far. I paused the scan to see if on the home page I could try to load and save the current scan but DCP5 seems stuck and the drop down menu for load and save scan refuses to become active for selection. Not sure if I should try to wait to see if the program will respond or if I just need to blow everything up and start over by dropping all my 2.7m photos and videos into separate folders as you suggested. Can you please advise what would be best at this point?
bsacco
Posts: 63
Joined: Sun Jan 02, 2022 9:47 pm

Re: Best practice when dealing with millions of photos

Post by bsacco »

Btw, the first thing I did was attempt to rename all my photos and videos using the prefix: YYYY-MM-DD. I used different programs to attempt to extract the date taken. If the file did not have any exif data I tried to manually name it correctly.

So, as I said previously I just need some advise on how
Deal with separating folders and applying the correct procedures for deduping accurately
user12345
Posts: 3
Joined: Sat Jan 11, 2025 5:35 pm

Re: Best practice when dealing with millions of photos

Post by user12345 »

I too have had to restart the process but nowhere near the timescales/size of what you are doing

My advice is to stop and take the hit (and maybe learning), time-wise

Noted you've renamed them, I use a shorter format - YYMMDD fyi. Again be careful of extending file names and paths, number of character-wise, as (I'm on DOS/Windows) for me there is a limit on number of characters, again which causes a system error

And on from my note above, grab paper and note down all the different file types within your folders, then segment, segment, segment. Deduping a folder with half hour long videos and selfies wouldn't be the way I'd do this; deduping different file types likewise (albeit I have NEFS converted to .jpgs and .mts to .mov, just a mare

In segmenting, I try to get my head around the data first off - as I have back-ups, zip back-ups, back-ups of back-ups, half back-ups where I've given up the process, deduped and moved to another folder etc - so adding to the total mess above on my side.

I start numbering folders either manually of using a free program as appropriate

Remember there are two parts to this process that I see, deduping and filing correctly

Build a master folder (say called 00 Master) and then decide and then plan how you want to file them - by year/month (for family stuff I now use by year and then folder by occasion (2024/Summer Garden Party) and then catch all's - 2024/Videos in the year); I don't use month numbers - too techie looking); again tech sees this one way, users another - this is about creating memories around occasions, so the folders may well contain a variety of media formats (video, pictures, phone selfies, audio and more - it's building an experiential experience where you can relive moments so think about your final presentation

Then use a folder size analysis program to scan the disc (takes a while to do) and try to find the folders with the most files in them that are in a sensible order/type - copy this to the master (clean-up and dedupe if necessary), protect and dedupe against this folder by segment

I'm cleaning up just over twenty five years worth of stuff having each year started and then given up the process. Hoping to nail it this session

(I've spent decades helping brands/tech companies communicate - talk in an appropriate way that users both like and understands - which is different to in-company talk; there should be user case videos covering all the above issues. That said I have been using Duplicate Cleaner for years and years and finally upgraded to Pro; my experience was that the program was near totally different to the free version which I knew so well and loved, the UI and UX seemed totally different and was incredibly disappointed to have paid and been faced with what I downloaded. Again no comms to explain this difference in advance of upgrading)

@bsacco hope this assists
bsacco
Posts: 63
Joined: Sun Jan 02, 2022 9:47 pm

Re: Best practice when dealing with millions of photos

Post by bsacco »

THanks user12345.

I do have some simple questions.

SHould I separate Photos from Videos? Or do NOT separate them. DCP5 has different modes of de-duping Regular, Image and Video. My confusion starts with should I only be throwing photos at DCP5 when DeDuping using IMAGE MODE? and the same....Should I only be throwing Videos at DCP5 when DeDuping? Is it appropriate to co-mingle Photos and videos together when using IMAGE MODE?

Do you see how this can get very confusing?

WHat is Regular MOde? Does it handle anything you throw at it? If so, does it do a less better job if I separated all the photos out first and then threw it at DCP5 using IMAGE mode?

These are the maddening questions I have about this program that remain unanswered.
Post Reply