question: why is md5 default and not byte-to-byte

The best solution for finding and removing duplicate files.
Post Reply
User avatar
Emerson

question: why is md5 default and not byte-to-byte

Post by Emerson »

Hey DV, I notice that md5 is set as the default option for how files are compared. Isn't byte-to-byte typically faster AND more accurate since it doesn't involve any of the hashing overhead and you can use tricks (like reading from somewhere in the middle or end of the file to short circuit non-identical files). Anyway, I'm just reporting bugs/questions/observations as I use DC2 because it's by far my favorite dupe checker, and it really is the little things that make a good program great (and DC2 is well on its way to being great) :) Thanks again for the great program you've provided.
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

For any number of files to be compared for duplicates, I think it is easier to (first) check if their MD5 are the same. If not, their contents definitely are not the same.

The same tricks could be applied to MD5's: just scan parts of the files, and compare the MD5's of those parts. Again, this requires less storage and/or re-use or re-reads than comparing the complete sets of bytes the MD5's are based on.
User avatar
Emerson

Post by Emerson »

Which is why I'm asking why he chose md5. The best option would be an algorithm that adaptively choose the fastest comparison mode... so like if you have a large bucket of same sized files it could begin by doing partial md5's to avoid having to read the same file multiple times and then going to byte-to-byte only for full comparison of files with matching partial md5 fingerprints. If there are only two same sized files then a byte-to-byte comparison will always be faster and more accurate.

I'm not sure how the overhead of doing cryptographic calculations to generate file hashes compares to the time required to do file reads (particularly if you have a file cached in memory). Would love to hear back from DV on how he implements the hash comparison options at the moment, and what he thinks about an "adaptive" or "smart" hash/byte-to-byte hybrid option.
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

You would _only_ have a benefit if you are _only_ comparing two files when using the byte to byte comparison method.

If you only have two files to compare, you might as well run the FC dos command. I guess that's even faster.
User avatar
Emerson

Post by Emerson »

"You would _only_ have a benefit if you are _only_ comparing two files when using the byte to byte comparison method."

Implicit in your statement is that the overhead from calculating a hash (which is not exactly trivial: http://en.wikipedia.org/wiki/MD5#Pseudocode) is less time consuming than reading the contents of a file into an array (if you cache files or chunks of files into memory the i/o bottleneck of reading a file is severely mitigated)

Hash comparisons typically require that for every file, you read the *entire* file AND perform a series of nontrivial *hash calculations* before you can tell if the two files are identical or not. Byte-to-byte comparison requires that you run down two file streams instead of just one, but can short circuit for non-identical files at the first non-matching byte (rather than having to read to EOF and perform considerable additional calculations as you do with hash comparison).

If two same sized but non-identical files are different before the 50% point then you have read *less* data total (< 50% from each file) than you would have with a hash comparison.

Without knowing what the time cost of doing the hash calculations vs the time-cost of doing a file read (and keeping in mind that memory caching is going on) I don't think it's a given that byte-to-byte is *only* faster when comparing two files. I'll do some benchmarking later. :P
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

I think you are not reading or trying to understand what I wrote before. This so beloved byte-to-byte comparison "cheat" of yours is also applicable to hash calculations: you can perform them on any part of the file.

Next to that, it seems your interest is mostly in comparing two individual files instead of finding duplicates within any number of files. You can easily store a large number of MD5's in memory as opposed to the complete contents of a couple fo files.

You may rethink your logic or just look around in the Duplicate Cleaner documentation to see why it works as it works... and it works!
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

Did some re-search myself. Just read this short thread in this same forum and perhaps you'll be amazed:

"Suggestion for md5/size and links"
http://www.digitalvolcano.co.uk/forum/v ... 8&topic=55
User avatar
DV

Post by DV »

DC does optimize the hash checking (as linked above). I think there is a lot of scope with byte-to-byte checking however, and it could be potentially a lot faster, and Emerson mentioned.
The Byte-to-byte was dropped into DC2 relatively late on in development, and isn't particulary optimised yet. It caches one file in memory and checks it against the others of the same size, dropping out when a different byte is met.
Hybrid approach sounds interesting - will investigate.
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

How much of a file is cached in memory for the byte to byte comparison? Files can be very large, of course...

In case of a mismatch, you might even "branch" te original and the different file's contents at the point of the first difference, to allow a quicker (and smaller) comparison using some variations of similar files.

You could also reduce the number of comparisons of pairs of files by first sorting them according to the first (whatever number of) bytes. This is a very simple approach, so I would definitely not be surprised if DC already performs this as a first step.
Post Reply