feature suggestion: comparison source/destination

Emerson · Post by **Emerson** » Fri Nov 05, 2010 2:31 am

At the moment, it seems DC2 (and DC1) check for duplicates between every file in the path. Being able to select a source (the files you want to find duplicates of) and a destination (the files you want to compare the source files against) would considerably speed up the file comparison phase when dealing with very large datasets.

Consider the following scenario:
- you have a datastore of 1 million images
- you run DV2 on the full datastore dedupe or hardlink the images (you only need to do this once).
- you add a new picture folder from your latest photoshoot to the datastore with 10 images

Rather than running DV2 on the *entire* set of 1M + 10 images, if you could specify that you only want DV to compare the 10 *new* images against the 1M *already* deduped images, the comparison algorithm can run much, much faster by dropping any images with filesizes that are different from the ten images being compared.

That being said, thanks for the great program and for being responsive to your users.

DV · Post by DV » Fri Nov 05, 2010 8:49 am

Thanks Emerson, this is something i've thought about before but never tried to implement - I will revisit it .

NetNomad · Post by **NetNomad** » Fri Dec 03, 2010 3:18 pm

I have a similar issue with a large multi-terabyte collection of video files housed on multiple drives. Files range from 1mb to 2gb with 200mg to 1gig comprising the majority by space. My task is to identify any unique files among hundreds of duplicates acquired in batches from multiple sources.

What I want is a way to save the hash values from the main collection. This should significantly speed up the comparison process.

Take a look at Microsoft SyncToy v2.1. SyncToy creates a hidden database file that retains the hash, modified date & time, file length, path, & file name. I use this tool to sync archives with local backups, that comparison takes a small fraction of the time DC does.

What I would like to see in DC is an option to save the the calculated hash values for an associated drive or directory tree. This file unlike SyncToy should not be saved in drive or directory path but rather in a user defined location. I have an issue with SyncToys hidden in place files confusing other tools.

In use this file it would be used automatically by DC when working with the stored path OR a sub path of a stored path. DC would check file: path, name, length, date and time. If anything changed it would re-calculate the hash, as it would do for any new files and delete any missing files from the database. That logic would handle user renames and moves, simply by recalculating a new entry and deleting the old.

For use a simple check option to automatically retain values for common paths and a box to enter the paths in, and well as an option to specify a different storage directory from the default in a sub directory in the DC program folder. There should also be an on screen notice so the user realizes the retained values are being used.

I hope you understand what I am asking. Anything to shorten my multi-hour overnight checks.

myname at ymail

Emerson · Post by **Emerson** » Wed Dec 29, 2010 9:59 pm

Long time no post. I hope everyone's holidays have been going well, and everyone enjoys the new year!

That said, I was curious if you've given any further consideration to this functionality. I think for people that have to routinely merge files/folders from disparate sources, being able to specify that they only want to compare the *new* content to the already deduped *old* content would be a huge difference.

Fool4UAnyway · Post by **Fool4UAnyway** » Wed Dec 29, 2010 11:13 pm

This looks more like a two folder comparison than a one to many folder comparison. Perhaps just using a folder comparison tool it a good way or start to ultimately get what you want.

When trying to merge differences or different versions, you will have to look at the contents, anyway. So you will have to actually perform the file comparisons as well.

Emerson · Post by **Emerson** » Mon Jan 10, 2011 1:20 am

Most folder comparison utilities compare folders based on the folders having similar/identical folder hierarchies. What I'm talking about is being able to point at a folder(s) and say take everything in this set of folder(s) including subfolders and find any duplicates between this set of files and anything in this other set of folder(s) including subfolders.

I'm not just interested in the contents of identical folders, or merging identical sources. If I'm merging a large music collection from some computer that exists in various folders and subfolders with a centralized music collection that's already been fully deduped, my suggestion is to avoid the redundant and unnecessary work of comparing the contents of the known-to-be deduped files against each other and only take into consideration the "new" files that have not been scanned already. Does that make more sense with that example?

Fool4UAnyway · Post by **Fool4UAnyway** » Mon Jan 10, 2011 10:41 pm

Yes, this example is a good demonstration for the use of a specific adjustment that may improve processing speed.

DV · Post by DV » Wed Jan 12, 2011 8:30 pm

The problem as Emerson described it is something I hope too add to the next gen of Duplicate Cleaner.

Fool4UAnyway · Post by **Fool4UAnyway** » Wed Jan 12, 2011 10:56 pm

Perhaps this feature can be extended to include just any of the folders that are included in the search for duplicates.

One may have just multiple libraries that they want to keep complete, with the wish to remove anything duplicate from other less or incomplete (or) smaller libraries.