Page 2 of 2

Re: Best practice when dealing with millions of photos

Posted: Sat Jan 25, 2025 8:05 pm
by bsacco
OK, let me address your proposed solution one step at a time.

You propose putting all my photos (over 1 million of them) into (Pictures) otherwise known as C:\Users\bobsa\OneDrive.

Well, I cannot do this because I have over 1 million photos and videos and only a 500 GB SSD C Drive.

So, it is impossible to put all my photos and videos on my C Drive.

Currently, I have all my photos and videos spread out into a maximum number of files (75,000) into multiple folders across two large internal HDD 4TB drives.

How do I go about finding all the best quality original photos and videos among all the dupes across all the folders and drives?

Re: Best practice when dealing with millions of photos

Posted: Mon Jan 27, 2025 6:38 pm
by bsacco
Bump?

It's not very responsive here in this forum.

Am I in the right place?

Re: Best practice when dealing with millions of photos

Posted: Mon Jan 27, 2025 6:42 pm
by bsacco
Punar-

Since I'm dealing with such a large volume of files (photos and videos) shouldn't I be separating by file type first. That is, separate all the photos from videos BEFORE deduping?

The other question is once I separate all these files what are the recommended settings in DCP5 for deduping to a Master folder which will include only one copy of the best photo available?

ANy tips are appreciated.

Best, bob

---------------------------------------------------------------------------------------------------------------


punar wrote: Tue Jan 07, 2025 12:48 pm
I'm not exactly sure why this program has zero support. I mean, why have a forum if you don't respond to customer questions?
But you did get a reply from DV just two hours before...
The easiest way to dedupe photos and videos is to put them into ONE folder and run DCP5. Yes, BUT my folders are now too large and it crashes the program.
No, don't put them in one folder.
Best practice when dealing with millions of photos
Not easy to answer because it can be done in different ways and your scenario could be very different to others.

So let me answer how I would do it if it was my computer and then you can decide for yourself what is relevant.

Location:
Store all photos and videos on an internal disk and use external disks only for backup.
Preferably use the Pictures folder that is for that purpose as the main location.

From now on I will assume your main location is \User\Pictures

Clean the main location
If you have some large folders within \User\Pictures that you know are probably unsorted dupes, move them to a new folder not within \User\Pictures, for example \User\ProbablyDupes

In Duplicate Cleaner
- go to Scan Location and select \User\Pictures and \User\ProbablyDupes
(or since you have millions of files, perhaps do each one separately before selecting both of them)
- go to Scan Criteria and select Regular Mode, Same Content
- Go to Scan and click Start Scan
- Wait
- In the Scan Result, click a file that is within \User\ProbablyDupes
- In the Selection Assistant on the left side, under Mark by location, click the little button "Get selected folder name from duplicate list", and then under "Mark the files in this folder that has duplicates elsewhere" make sure "Also preserve (unmark) files elsewhere" is checked and click Mark
- Go through the list and see if it looks good
What you look for when you skim through is file names and folder names so you don't lose naming changes you have done:
ie1 if you have duplicate1 "DCIM1234.jpg" and duplicate2 "Summer2017.jpg", you want the generic name DCIM1234 to be tagged and deleted.
ie2 if you have a duplicate inside folder1 "\pictures\misc" and folder2 "\pictures\2017\summer\beach" then you want the ones in "\pictures\misc" to be tagged.
- Don't select any more files now
- Under Delete, uncheck "delete to recycle bin". At least one copy of the original photo will always remain on your computer (unless you force Duplicate Cleaner).

If you still have files in the duplicate list after this, use other criteria
- From the selection assistant on the left side to select chunks of files. For example
"File name" "contains" "Copy"
"Shortest file names in each group"
"Shortest folder path in each group"
- Also right click a file and select "Mark all files in this folder"
- Also go back to Scan Criteria and select Image Mode, Exact Match, and rescan

After each selection,
* Go through and check what has been selected as described above
* Delete (Delete after each chunk

When the duplicate list is clean, move whatever is left in \User\ProbablyDupes back to \User\Pictures

I guess this will get you started.
Hope you find this useful.

Re: Best practice when dealing with millions of photos

Posted: Fri Jan 31, 2025 1:45 pm
by MegMac
Organizing digital photo and video collections is a very complicated process if one wants to do it accurately, thoroughly, and efficiently.
De-duping is usually the most complicated and biggest task - especially if information in folder names and added as metadata need to be preserved.

I have been collaborating with DCP5's developers since version 5 was in beta, and have contributed to many of the changes and improvements made since then.

I offer courses for professional photo organizers, and my Legacy Workflow Series courses rely on DCP5 for de-duping and for most other tasks as well.
DCP5 is very versatile and can be used for so much more than duplicate removal.

My courses are endorsed by the lead developer at Digital Volcano.

Some DIYers have purchased my courses, but they are designed for pros. I won't try to 'sell' you my courses, but I suggest you watch the course preview videos. They will provide insight into many topics in this discussion.

Photo Organizing Stuff YouTube Channel: https://www.youtube.com/channel/UCW17gL ... hVQ2WswIrA

Courses.PhotoorganizingStuff.com: https://courses.photoorganizingstuff.com/

Re: Best practice when dealing with millions of photos

Posted: Fri Jan 31, 2025 2:05 pm
by MegMac
bsacco wrote: Sun Jan 05, 2025 7:46 pm I'm not exactly sure why this program has zero support. I mean, why have a forum if you don't respond to customer questions?

I am very disappointed and feel betrayed as a paying customer.
De-duping photos accurately and thoroughly is incredibly complicated. There are so many situations and needs and goals.
If the admin of this forum were to provide step-by-step instructions, it would be a full-time job.
And DCP5 is very complicated - it needs to be because de-duping is so complicated (to do well). I know, because teaching duplicate removal and digital organizing IS my full-time job.

Re: Best practice when dealing with millions of photos

Posted: Fri Jan 31, 2025 2:16 pm
by MegMac
user12345 wrote: Mon Jan 13, 2025 12:38 pm I too have had to restart the process but nowhere near the timescales/size of what you are doing

My advice is to stop and take the hit (and maybe learning), time-wise

Noted you've renamed them, I use a shorter format - YYMMDD fyi. Again be careful of extending file names and paths, number of character-wise, as (I'm on DOS/Windows) for me there is a limit on number of characters, again which causes a system error

And on from my note above, grab paper and note down all the different file types within your folders, then segment, segment, segment. Deduping a folder with half hour long videos and selfies wouldn't be the way I'd do this; deduping different file types likewise (albeit I have NEFS converted to .jpgs and .mts to .mov, just a mare

In segmenting, I try to get my head around the data first off - as I have back-ups, zip back-ups, back-ups of back-ups, half back-ups where I've given up the process, deduped and moved to another folder etc - so adding to the total mess above on my side.

I start numbering folders either manually of using a free program as appropriate

Remember there are two parts to this process that I see, deduping and filing correctly

Build a master folder (say called 00 Master) and then decide and then plan how you want to file them - by year/month (for family stuff I now use by year and then folder by occasion (2024/Summer Garden Party) and then catch all's - 2024/Videos in the year); I don't use month numbers - too techie looking); again tech sees this one way, users another - this is about creating memories around occasions, so the folders may well contain a variety of media formats (video, pictures, phone selfies, audio and more - it's building an experiential experience where you can relive moments so think about your final presentation

Then use a folder size analysis program to scan the disc (takes a while to do) and try to find the folders with the most files in them that are in a sensible order/type - copy this to the master (clean-up and dedupe if necessary), protect and dedupe against this folder by segment

I'm cleaning up just over twenty five years worth of stuff having each year started and then given up the process. Hoping to nail it this session

(I've spent decades helping brands/tech companies communicate - talk in an appropriate way that users both like and understands - which is different to in-company talk; there should be user case videos covering all the above issues. That said I have been using Duplicate Cleaner for years and years and finally upgraded to Pro; my experience was that the program was near totally different to the free version which I knew so well and loved, the UI and UX seemed totally different and was incredibly disappointed to have paid and been faced with what I downloaded. Again no comms to explain this difference in advance of upgrading)

@bsacco hope this assists
Dear User12345,
Your suggestions reflect common practices, but I disagree with most common practices for duplicate removal.

Keeping the original folder structure intact helps make duplicate removal easier and more efficient.

Keeping the original filenames is also important. A file's name is am very useful piece of information and can help in many ways, including helping to identify 'Undated' (no EXIF DateTimeOriginal) versions of dated photos.

I do rename photos, but only when all de-duping is complete. Do not rename by Date taken unless all 'Undated' photos have been separated out. Many photos have a 'false' Date Taken because they have a date in a filed like EXIF Modified date, but not an EXIF DateTimeOriginal.
https://courses.photoorganizingstuff.co ... ted-photos

Re: Best practice when dealing with millions of photos

Posted: Sun Feb 02, 2025 9:44 pm
by bsacco
OK, to document the process I have created this method for organizing LARGE family photos/video collection AFTER a computer CRASH and/or CORRUPTION using easy-to-use plug-in-play software (no coding or scripts)

My method is built out of old-school hard knocks as I failed to find any useful information online on how to approach de-duping large unorganized projects. I am seeking to post this as a living document to be improved upon as I am not an expert in DCP5 software or any other software.

Please feel free to add efficiency or any better methods as you see fit.

1) Create two different multiple folder structures to hold your LOOSE images and video files. Example:

IMAGES_1
IMAGES_2
IMAGES_3…all the way up to IMAGES_45 depending on how many files you have. I had 2.7M images. Each folder was holding about 30k images.

Same for Videos:

VIDEOS_1
VIDEOS _2
VIDEOS _3…all the way up to VIDEOS_45

Then, create two MASTER FOLDERS for later….

MASTER_PHOTOS (notice the name change from image to photos here because now we are dealing with original family photos)
MASTER_VIDEOS


2) As far as I know, NO SOFTWARE EXISTS that can differentiate Screenshots, logos, etc… from photos. So, I use DCP5 to separate these.

How?
Open DCP5. For photos run IMAGE MODE. For Videos run VIDEO MODE.

In Selection Assistant I use the following settings:
Mark by Group = All but one file in each group
Preference marking =
Smallest image dimensions
Smallest file in each group
Longest file names in each group
File name contains “copy” (option here, I have lots of files that have “copy” in file name)
File name contains “duplicate” (option here, I have lots of files that have “duplicate” in file name)


Start first de-dupe pass running EXACT MATCH. Under View, select Thumbnail View. Inspect each group of images. If the group features a Screenshot, logo, or a non-family photo you can mark the entire group for deletion or transfer by moving to another folder of your choice.

Rinse and repeat…now using VERY CLOSE MATCH…. Delete or transfer selected images….

Rinse and repeat…now using GOOD MATCH…. Delete or transfer selected images….

When complete, you should have separated all IMAGES (screenshots, logos) from PHOTOS (family photos). Repeat same for Videos.

Be patient. This is a long process but you will end up with a significantly pair-down amount of work ahead of you.

3) Now, you should have a bunch of folders filled with loose images and videos
Example:
IMAGES_1 (approx. 30k images.)
IMAGES_2 (approx. 22k images.)
IMAGES_3 (approx. 27k images.)
Etc..

Same for Videos:

VIDEOS_1 (approx. 17k images.)
VIDEOS _2 (approx. 26k images.)
VIDEOS _3 (approx. 21k images.)

4) NAMING the Files with DATE prefixes and EXIF data will help you organize your photos and videos automatically.


Step 1

Run NamEXIF. It is a bulk renamer utility that searches your files in bulk/quickly with one click and renames to your PREFIX naming convention of choice.

I use the prefix naming convention of YYYY_MM_DD_ . Many people don’t understand the advantages of using underscores.

Key benefits of using underscores in file names:
Compatibility:
Most operating systems and software can handle underscores without issues, unlike spaces which can sometimes cause errors when processing filenames.
Clarity and Organization:
Underscores clearly separate words in a filename, making it easier to understand the file's content at a glance.
Coding Efficiency:
When working with scripts or programs that interact with files, using underscores can simplify file path manipulation and identification.

Here’s the great thing about NamEXIF. If your file does not have any EXIF data it skips that file and leaves it unchanged.

Step 2

For the remaining files the NamEXIF program leaves alone I run a program called Bulk Rename Utility. It will load in bulk about up to 30k files (without crashing) and you can select the following setting to fill in the rest of your date information (in bulk) using the following settings:

Go to Auto Date (8).
Mode = Prefix
Type = Item Date
Fmt = Custom
Custom = %Y_%m_%d_ (this produces the following prefix format of YYYY_MM_DD_)

Now in theory, all your images and videos should have date prefixes.
Example: 2001_08_22_Olivias_first_Bday or 2006_10_28_File_032821

5) I have not yet proceeded to this point, but I assume you would use DCP5 to COMPARE image and video Folders for Dupes? This would leave you with deduped original prefix dated Photos and Videos.

If anyone knows how to do this with the proper settings, or has a better way or software please let me know.

6) After completing step 5, I plan to use a program call PhotoMove Pro 2.5. It automatically organizes your loose photos/videos into a hierarchy folder structure by YEAR and subfolder by month. Example:
2006
2006_01
2006_02
2006_03 etc….

CONCLUSION: As I said before, I have created this process using pre-existing plug-in-play software. I tried to create as much efficiency and less work as possible in this large project.
If anyone has any ideas about how to make this process more simple with less work, please add your comments below.

Re: Best practice when dealing with millions of photos

Posted: Mon Feb 03, 2025 3:06 am
by bsacco
Hi Punar,

Yes, your post was helpful in getting me started.

I documented my ongoing process so far with the last post in this thread. Hopefully, that will give you an idea on where I'm headed and perhaps you can provide more guidance as to working with DCP5?

The one area that confuses me is the LAST STEP where you have to COMPARE weeded-out Dupes vs. clean across multiple folders (I'm dealing with 50 folders worth). I'm UNSURE that I'm actually catching ALL the dupe across all these folders. What setting in DCP5 guarantee it?

If you have any tips on how to proceed doing this, please let me know.

-bob
bsacco wrote: Mon Jan 27, 2025 6:42 pm Punar-

Since I'm dealing with such a large volume of files (photos and videos) shouldn't I be separating by file type first. That is, separate all the photos from videos BEFORE deduping?

The other question is once I separate all these files what are the recommended settings in DCP5 for deduping to a Master folder which will include only one copy of the best photo available?

ANy tips are appreciated.

Best, bob

---------------------------------------------------------------------------------------------------------------


punar wrote: Tue Jan 07, 2025 12:48 pm
I'm not exactly sure why this program has zero support. I mean, why have a forum if you don't respond to customer questions?
But you did get a reply from DV just two hours before...
The easiest way to dedupe photos and videos is to put them into ONE folder and run DCP5. Yes, BUT my folders are now too large and it crashes the program.
No, don't put them in one folder.
Best practice when dealing with millions of photos
Not easy to answer because it can be done in different ways and your scenario could be very different to others.

So let me answer how I would do it if it was my computer and then you can decide for yourself what is relevant.

Location:
Store all photos and videos on an internal disk and use external disks only for backup.
Preferably use the Pictures folder that is for that purpose as the main location.

From now on I will assume your main location is \User\Pictures

Clean the main location
If you have some large folders within \User\Pictures that you know are probably unsorted dupes, move them to a new folder not within \User\Pictures, for example \User\ProbablyDupes

In Duplicate Cleaner
- go to Scan Location and select \User\Pictures and \User\ProbablyDupes
(or since you have millions of files, perhaps do each one separately before selecting both of them)
- go to Scan Criteria and select Regular Mode, Same Content
- Go to Scan and click Start Scan
- Wait
- In the Scan Result, click a file that is within \User\ProbablyDupes
- In the Selection Assistant on the left side, under Mark by location, click the little button "Get selected folder name from duplicate list", and then under "Mark the files in this folder that has duplicates elsewhere" make sure "Also preserve (unmark) files elsewhere" is checked and click Mark
- Go through the list and see if it looks good
What you look for when you skim through is file names and folder names so you don't lose naming changes you have done:
ie1 if you have duplicate1 "DCIM1234.jpg" and duplicate2 "Summer2017.jpg", you want the generic name DCIM1234 to be tagged and deleted.
ie2 if you have a duplicate inside folder1 "\pictures\misc" and folder2 "\pictures\2017\summer\beach" then you want the ones in "\pictures\misc" to be tagged.
- Don't select any more files now
- Under Delete, uncheck "delete to recycle bin". At least one copy of the original photo will always remain on your computer (unless you force Duplicate Cleaner).

If you still have files in the duplicate list after this, use other criteria
- From the selection assistant on the left side to select chunks of files. For example
"File name" "contains" "Copy"
"Shortest file names in each group"
"Shortest folder path in each group"
- Also right click a file and select "Mark all files in this folder"
- Also go back to Scan Criteria and select Image Mode, Exact Match, and rescan

After each selection,
* Go through and check what has been selected as described above
* Delete (Delete after each chunk

When the duplicate list is clean, move whatever is left in \User\ProbablyDupes back to \User\Pictures

I guess this will get you started.
Hope you find this useful.