Byte to Byte vs MD5 ; Regular vs Image mode

dcoberlin · Post by **dcoberlin** » Tue Jun 17, 2014 7:25 pm

I am new to duplicate cleaner. I have been using a different program for some years.I am scanning duplicate image files. I have identified in regular mode using MD5 content comparison, 149,000 duplicate files. I am ready to delete 81,000 ie 166 GB but want to ask if there is any reasonable risk I am about to delete files erroneously identified as duplicate. What say? Would image mode or byte to byte offer any benefit?

therube · Post by **therube** » Tue Jun 17, 2014 7:40 pm

> is any reasonable risk I am about to delete files erroneously identified as duplicate

There will always be some risk with MD5, but not any reasonable risk IMO.
MD5 is known to allow dups, & you can find files that you can test with, that do show as dups, but in fact are not, but the chance that you or I have any "real" files that are false dup's, I'd say not.

I wouldn't bother at this point, but you could run some benchmarks & see if MD5 or byte-to-byte (or other types) are actually faster or slower then one another.

> Would image mode ... offer any benefit?

Knowing that you have true dups, a Regular Mode scan will be far more efficient then any other type.
So use Regular Mode first, to get rid of the "easy" stuff, faster, then if you wish to dig further, then I'd look at some of the other modes (Image in this case).

jackThom · Post by **jackThom** » Tue Jun 17, 2014 8:49 pm

The answer to your question requires a little explanation. This will also help you determine which mode to use for specific tasks in the future.

The difference between image mode and regular mode is that regular mode simply compares the files themselves. It will only find duplicates according to things like filename and date OR alternatively, if you specify, by actual byte content.

Image Mode actually looks at the picture and can help you find duplicates of the same image, even if the files themselves are different (e.g. same picture but: different size/resolution, different orientation (upside down, mirror image, etc.), different color scheme, etc.).

So the question you have to answer is, what are you concerned about:

-Exact-exact files (i.e. files that are no different at all. Exact same, byte for byte)
-Files that simply have the same name, timestamp, and size
-Images that are the same picture, but may have different characteristics (e.g. one is black & white, one is color....or one is a thumbnail size, and another is full resolution size)

As for MD5 and byte-to-byte:

MD5 is a hash algorithm. Basically what happens is the file is analyzed byte by byte, and is run through the algorithm to spit out an alphanumerical value. This value is supposed to function like a fingerprint. It is intended to be unique to every specific file. The idea is, when you run a file through the algorithm, if even the slightest piece of the file is different from another file (that is, the entire file is the exact same, except for one single byte), the resulting value will be totally different for each file.

Of course as you might imagine, because hash algorithms are designed to spit out a value with a specific bit length no matter how big the file itself is, (MD5 is 128-bit), eventually you run the risk of a "collision"...that is, finding two different data sources that will spit out the same value. Obviously the odds of this are very low, but it has happened, and when it does, the hash is said to be "broken" and is no longer considered safe to use for cryptographic purposes and the like. (This is why hash functions with longer values are considered more secure than shorter ones.)

Of course, for the purpose of deduplication, the risk is even less of a problem, for one, because you're not trying to secure data, just determine if you have duplicates...but also, the odds that you'll find a collision in your specific file search set are basically zero.

However, unless you really need a hash value for some reason, you might as well just run a byte-to-byte comparison, as not only will it eliminate the risk of a collision, it will be faster too.

Byte-to-byte:-
- iteratively reads all bytes from File A
- iteratively reads all bytes from File B
- compare read bytes from A and B

MD5 hashing algorithm:-
- iteratively reads all bytes from File A
- Computes File A Hash
- iteratively reads all bytes from File B
- Computes File B Hash
- Compares File A and File B hash

Not only does the hash computation consume more CPU power, hash comparison is not a 100% guarantee that the files are equal since hash collision is a possibility.

That being said, there's a few things that can be done to ensure better or faster byte-by-byte data checking:-

-Have it stop checking between two files on first discovery of inequality
-Read more bytes per block
-If the compared file sets are on different physical disks, multithread reads

(It's possible these have already been implemented in Duplicate Cleaner, and the optimum read bytes per block is already in place, but you get the idea.)

So just to reiterate:

-Use byte-to-byte to find exact exact duplicates of files.
-Use image mode (and the various parameters) to find duplicates of pictures (even if the pictures are different sizes or different color schemes, etc.)

You can even set just how similar the images should be (e.g. find duplicate images that are 85% similar). You might play around with those settings in the Image mode and see how their results differ.

dcoberlin · Post by **dcoberlin** » Wed Jun 18, 2014 3:11 am

Thanks jackThom. I was anticipating slower function with byte to byte. As for image vs regular I am certainly safe in regular mode as I might want to save varied versions of the same image.
I am ready to push the delete key!

DigitalVolcano Software Support

Byte to Byte vs MD5 ; Regular vs Image mode

Byte to Byte vs MD5 ; Regular vs Image mode

Re: Byte to Byte vs MD5 ; Regular vs Image mode

Re: Byte to Byte vs MD5 ; Regular vs Image mode

Re: Byte to Byte vs MD5 ; Regular vs Image mode