Duplicate Cleaner v1.3 features / wishlist

DigitalVolcano · Post by **DigitalVolcano** » Sat Apr 26, 2008 12:37 pm

Implemented so far:
-XP control Styles.
-Updated folder tree type interface
-Display/Ignore hardlinked files
-Un-hardlink files
-Export all files to CSV
-Rename files
-Invert Selection
-Date sort bug fixed

To Do / Requests:
-Localization (in progress) eg Spanish.
(Please let me know if you are interested in translations!)
-Sel Assistant option 'select all but one in each group by path'
-It would be even better if I could select a "master folder tree" and possibly a master folder from which nothing could be deleted and any duplicates in the other folders could be automatically selected.
-Fix bug where protected folder name check is case sensitive.
-Remove 'orphan' files from list when rest of duplicate group has been dealt with
- Add a quick right click option to files in the duplicate list to "select all duplicate files in the same folder". Once you realize that a particular folder has allot of duplicate files (such as photos) in it, it would be nice to be able to just right click on one file and have it select all duplicates in the same place.

-After you have run a scan and you have a list of duplicate groups on the screen, it would be much easy and less confusion if files that have been deleted disappear from the screen, alone with the ones that were not deleted. Or at least make them grey out, so that it is easier to see what you have left to go through. I like to incrementally go though the list, delete some, and then continue, but once files have been moved to the recycle bin, they aren't duplicates any more and should disappear.

- It would be a nice feature if there were a "just to be sure" option before you emptied the recycle bin that would scan each file recently added to it to make SURE that there is another copy of it elsewhere on the computer. Right now, I am scared to empty the recycle bin, so a quick scan to make sure I don't delete the last copy of a file by accident would be a nice feature.

Mark Cramer · Post by **Mark Cramer** » Mon Apr 28, 2008 12:06 am

Apart from the change colours in the group column when changing the sort from name to size, and saving the column widths between searches that I posted about before, showing the number or times each file is already hardlinked would be good.

BTW, I just did a test and there is a situation that should be tested for, or at least a warning put in the doco.

\temp\test\a\test.txt, exists. An ntfs junction (essentially a directory hardlink, see http://www.microsoft.com/technet/sysint ... ction.mspx) of \temp\test\a is created at \temp\test\d and a duplicate file scan is done, and you delete 1 file, and BOTH files are deleted.

DFC either shouldn't scan beyond junction points, or throw up a warning that you could delete all copies of a file if you delete or try and hardlink, where a junction exists.

Thanks again.

Mark

Mark Cramer · Post by **Mark Cramer** » Mon Apr 28, 2008 2:36 am

And any chance of using md5sums, rather than a CRC, at least as an option. Or do you think a filesize and CRC match is reliable enough?

Mark

Diggo · Post by **Diggo** » Tue Apr 29, 2008 8:26 pm

Thanks for pointing out the above - I'll look into it and see what can be done.
I've not implemented MD5's yet (CRC's are faster). I think they are a bit more reliable than CRC but I might skip them and look into a byte-by-byte comparision solution. Not in the next version though (Maybe 1.4?)

Mark Cramer · Post by **Mark Cramer** » Tue Apr 29, 2008 11:24 pm

?? A 'bit' more reliable? (sorry, Math geek mode ON)

CRC's are 32 bit, If you generate random files all different, you would need 77162 files, before you had a 50% chance of 2 files having different contents and a matching CRC, (not a 50% chance of a file matching the last, but of some 2 files in the group of 77162)

Md5sums are 128 bit (3.4x10^38) if you compared the md5sums of random, different files, you would need ~2x10^19 different files, before some 2 of them had a 50% chance of matching hashes, 20 million million million. I doubt that many different files will ever be created by mankind. That's 3 billion files for every person on earth.

Files needed for a x% match of 32 bit CRC's
30084 10%
43781 20%
55352 30%
66241 40%
88718 60%
101695 70%
117579 80%
140636 90%
160414 95%
198890 99%
2^32+1 100%

ain't Excel and a bit of maths wonderful.

DigitalVolcano · Post by **DigitalVolcano** » Wed Apr 30, 2008 9:29 am

You're right...

I've sourced a couple of MD5 algorithms. What I might try and do is add them as an (optional) secondary check for all CRC32 matches. That way there shouldn't be too much of a speed hit.

Mark Cramer · Post by **Mark Cramer** » Wed Apr 30, 2008 11:20 pm

Actually, that sounds like a great idea.

And not bothering to do even the CRC check if the files are already hardlinked together might be a worthwhile speedup too. I.E. only do the CRC if the files match in size and aren't hardlinked. Mind you, since computing the CRC and MD5sum both require you to read the files, you'ld want to hope they stayed in the cache.

And I shouldn't have put this discussion in the features request thread, sorry.

DigitalVolcano · Post by **DigitalVolcano** » Thu May 01, 2008 9:00 pm

No problem

Jan Odendaal · Post by **Jan Odendaal** » Wed May 07, 2008 12:31 pm

this is a great product!!!!!!
Q...is there a Vista version?

DigitalVolcano · Post by **DigitalVolcano** » Wed May 07, 2008 3:51 pm

Thanks.
It's reported that it works fine on Vista, though I haven't tested it myself.