size vs. content

deckard · Post by **deckard** » Thu Oct 20, 2011 9:03 pm

Hi - forgive me if this is a silly question, but why does choosing size instead of content produce more results? What criteria is used for content?

Also, why doesn't every file have a hash associated with it? I know I can create hash files (SFV, MD5SUM, etc.), but it seems some files have a hash already embedded.

Would it be better to embed a hash into the files I want to scan (how does one do that anyway?), or would that defeat the purpose as then they would all be unique?

Thanks!

therube · Post by **therube** » Fri Oct 21, 2011 5:28 pm

> why does choosing size instead of content produce more results?

Just for that very reason.

You can have an Apple & a Pear that both weigh the same (same size), yet the content, Apple (applesauce) or Pear (pear juice) are different.

> What criteria is used for content?

Depending upon the method selected, either a hash of some sort (MD5, or whatever) or a byte-by-byte comparison. A hash is a numerical computation of the files data, & collisions (false results) are known to occur for some methods (though still not very likely, & probably not an issue for intended usages in Duplicate Cleaner), where byte-by-byte compares respective bytes from each file to prove equality.

deckard · Post by **deckard** » Fri Oct 21, 2011 6:46 pm

Thank you very much for the clarification. While the answer does appear obvious, I was confused because I have used other duplicate scanners that perform a search by "size", yet they really must have been doing some form of content scan, as the results are different than Duplicate Cleaner.

DigitalVolcano Software Support

size vs. content

size vs. content

Re: size vs. content

Re: size vs. content