Page 1 of 1

size vs. content

Posted: Thu Oct 20, 2011 9:03 pm
by deckard
Hi - forgive me if this is a silly question, but why does choosing size instead of content produce more results? What criteria is used for content?

Also, why doesn't every file have a hash associated with it? I know I can create hash files (SFV, MD5SUM, etc.), but it seems some files have a hash already embedded.

Would it be better to embed a hash into the files I want to scan (how does one do that anyway?), or would that defeat the purpose as then they would all be unique?

Thanks!

Re: size vs. content

Posted: Fri Oct 21, 2011 5:28 pm
by therube
> why does choosing size instead of content produce more results?

Just for that very reason.

You can have an Apple & a Pear that both weigh the same (same size), yet the content, Apple (applesauce) or Pear (pear juice) are different.

> What criteria is used for content?

Depending upon the method selected, either a hash of some sort (MD5, or whatever) or a byte-by-byte comparison. A hash is a numerical computation of the files data, & collisions (false results) are known to occur for some methods (though still not very likely, & probably not an issue for intended usages in Duplicate Cleaner), where byte-by-byte compares respective bytes from each file to prove equality.

Re: size vs. content

Posted: Fri Oct 21, 2011 6:46 pm
by deckard
Thank you very much for the clarification. While the answer does appear obvious, I was confused because I have used other duplicate scanners that perform a search by "size", yet they really must have been doing some form of content scan, as the results are different than Duplicate Cleaner.