Duplicate Cleaner FAQ

Post by DV » Sun Jul 24, 2011 8:37 am

The Duplicate Cleaner FAQ is below. Please read before posting! Also, it is is a work in progress, so if anyone has any ideas for it, please post here.

Where to scan - Should I scan my entire hard drive?
It's generally not a good idea to run Duplicate Cleaner over your system files, applications, or anything that you might not be sure what it is for.
Duplicate Cleaner is at it's most useful when used on your own data - Documents, Photos, Music, Movies, Code files, etc. This way you can make an informed descision when deciding what to delete! Your My Documents folder is a good place to start.

Why is it taking a long time to scan?
Duplicate Cleaner is generally limited by the speed of your hard drives/network drives. If there are a lot of identically sized files the software will have to check each one down to the byte level - this will take time.

I have lots of duplicate files found- which ones should I delete?
As a general rule you should only delete a file if you know what it is. If you think you might need it then keep it! It is always a bad idea to delete files from your Windows and Program Files folders unless you know what you are doing.

I can't delete a file because it is protected?
This is because you have ticked one of the settings in the Options menu -
Protect Windows/Program Files, or Protect DLL/EXE Files. You should only delete these type of files if you know what you are doing.
I get access or permissions errors when deleting a file?

Try running Duplicate Cleaner as administrator. In Vista or Windows 7 you can right click on the Duplicate Cleaner icon and select "Run as Administrator".

What is a (MD5, SHA-1) hash?
A Hash is like a fingerprint for a file. Duplicate Cleaner compares these 'fingerprints' to help it identify duplicate files.

daveme7 · Post by **daveme7** » Fri Jun 01, 2012 12:45 am

Better than me-knowing I have duplicates and guess what-it always comes with no duplicates found. I guess I am getting spoiled when I download a piece of software and it works as intended

dijitul · Post by **dijitul** » Thu Oct 18, 2012 2:21 am

I'd like to better understand how the search/filter procedure is happening in this application as it seems to take MUCH longer than other competing products for the same set of files.

My theory? Based on what I see happening...

A duplicate search should only calculate hashes AFTER it matches a file's size -- files without matching sizes are certainly not going to have matching hashes. The 1st phase of searching should be generating a file list with all relevant file data (size, path, properties, etc.), and the 2nd phase should be filtering out files without matching file sizes. This ought to leave a much, much smaller set to generate hashes against. (Naturally, this situation doesn't apply where you're comparing image similarities, but that isn't the case here.)

As an example, in a list of 255,000 files, it's taking over an hour to process and it's reporting to have only found 3500 matches at less than 10% into the list. It shouldn't even be counting any duplicates yet because there definitely WILL NOT be another 250,000 worth of duplicates.

So, either the efficiency of the search/filter process is poor, or the efficiency of the matching algorithms are poor. Whatever it is, it needs revamping. How long should 1 million file comparisons with only 5,000 duplicate files take? It should only take as long comparing 1 million file sizes and 5,000 hashes!

Post by **DigitalVolcano** » Thu Oct 18, 2012 10:07 am

The hashes are only calcuated upon a size match. The hashes are cached as it goes along (and are only calculated once!). The percentage can be misleading as it speeds up as it progresses (it has more hashes cached as it goes along). The real time killer is populating the list at the end.

At any rate, Duplicate Cleaner 3.1 is currently undergoing a re-engineering. List population will be almost instant, memory usage lower, and it should cope with huge file sets.

dijitul · Post by **dijitul** » Thu Oct 18, 2012 11:50 am

I really have to disagree, even knowing nothing about your coding. The hashes are most definitely being calculated for every file regardless.

Currently, it's scanning 314100 of 629314 files (49.9%) and it's been several hours (I don't even remember what time I started it at this point). The hard drive is going crazy with drive-head movement, and the number is increasing like molasses. It's only found 9100 duplicates sets so far. This should have been done in 10 minutes, but I'd even settle for 30 minutes.

I won't pretend like I've coded something like this before, but if I were to then I'd probably use some sort of multi-phase process like the following (using proven efficient sorting and data storage algorithms):

Phase 1:
COLLECT file data (requested file types, file names, folder paths, file sizes, dates, etc.)

Phase 2:
SORT file data by the primary parameter (most likely the file size, but could be dates, or paths or folders, or whatever parameter is key to the following phases).

Phase 3 (splitting into two parts might reduce overall operations, but that will need testing):
Part A) EXAMINE all files sequentially and flag as a "potential duplicate" when the primary parameter matches the FORMER item in the list.
Part B) REPEAT examination of ONLY the remaining unflagged items and flag any as "potential duplicate" when the primary parameter matches the LATTER item in the list.
OR combine the comparison of former and latter items in the list into one swoop, but that might be less efficient because you'd be duplicating tests.

Phase 4:
REDUCE flagged items by comparing any secondary or tertiary parameters but NOT hashes, and un-flag when the test fails.

Phase 5:
If content checking is enabled, GENERATE hashes only on remaining flagged items and repeat Phases 2 & 3 against the hashed items only. Un-flag any with unmatched hashes.

Phase 6:
ORGANIZE and PRESENT results of remaining flagged items in groups to user.

Just my two cents worth...

Otherwise, thanks for a useful product. It certainly has great selection aids compared to many other competing applications.

Post by **DigitalVolcano** » Thu Oct 18, 2012 12:25 pm

This is basically what it does - there are efficiencies in the next version as well.

You must have a lot of similar size (large) files. I'd be interested in the results, though once you hit 500,000 files or so there can be memory problems with the list. Again, all being addressed in 3.1!

dijitul · Post by **dijitul** » Thu Oct 25, 2012 3:42 am

I didn't perform a count of duplicates based on file-size alone, so I can't answer that question this time.

I would like to add one more consideration to your algorithms -- when there are large files involved (perhaps over 100 MB or some configurable limit) then allow hashing only small portions of the files (maybe 1 MB of the start, middle, and end of the files) prior to hashing the entire file. This should increase speed because if any of those portions are different, the program can eliminate possible gigabytes of data. An example set of data that would benefit are folders with DVD "VOB" files that tend to be exactly 1 GB in length.

Thank you.

DigitalVolcano Software Support

Duplicate Cleaner FAQ

Duplicate Cleaner FAQ

Re: Duplicate Cleaner FAQ

Re: Duplicate Cleaner FAQ

Re: Duplicate Cleaner FAQ

Re: Duplicate Cleaner FAQ

Re: Duplicate Cleaner FAQ

Re: Duplicate Cleaner FAQ