Page 1 of 1

Re-Scan MD5 Duplicates, byte-to-byte

Posted: Wed May 31, 2017 5:48 pm
by Duplicate
Hi

I'm using the latest version of Pro and was wondering if it was possible to scan for duplicates using the fast MD5 method, and then have the option to scan identified duplicates using the byte to byte method?

Thanks

Re: Re-Scan MD5 Duplicates, byte-to-byte

Posted: Thu Jun 01, 2017 3:06 am
by therube
You can change the scan method; MD5 or other hash methods, or byte-to-byte.
But you cannot do a MD5 compare & then directly send that list to be scanned byte-to-byte.

In any case, IMO, if a file compares MD5 or SHA-1 or SHA-256...
While there are potential clashes...
The chance of, much less the importance of...

Sending a spaceship to the moon, is one thing.
Getting rid of duplicate MP3's is another.

Also, benchmark the various comparison methods.
There might be times where one method works out better then another, even with different source files.

viewtopic.php?p=6128#p6128


Some time back, i ran some "hash speed tests" (not with DC).

With many thousands (80K) of small files < 400KB, all hashes (& byte-to-byte) benchmarked essentially identically.

With a smaller number (~200) of larger files (4-20 MB range), a hash comparison was much faster then byte-to-byte.
(In the hash, SHA-1, case much of the work was done in RAM, rather then the hugely expensive "disk access" that byte-to-byte was requiring.)

Re: Re-Scan MD5 Duplicates, byte-to-byte

Posted: Sat Jun 03, 2017 9:21 am
by Duplicate
Thanks, great advise.
Much appreciated :-)

Re: Re-Scan MD5 Duplicates, byte-to-byte

Posted: Fri Nov 10, 2023 11:25 am
by wwcanoer
Old post, but I wish that I could do that too.

One way is to move the duplicate files from source 2 to a new "deleted" folder and then re-run by comparing source 1 to the "deleted" folder with the byte to byte. The result should be that everything in the deleted folder is a duplicate, so you select all duplicates in that folder, delete them, and the folder should be empty. If there's any files left, it would really interesting to see the files with the same size and MD5 that aren't the same.

Re: Re-Scan MD5 Duplicates, byte-to-byte

Posted: Mon Nov 13, 2023 10:33 am
by DigitalVolcano
This would be super rare to have an MD5 collision in a normal fileset. I do have a couple of jpgs with different pictures within and the same MD5. They were deliberately created using a supercomputer though!