size vs. content

The best solution for finding and removing duplicate files.
Post Reply
User avatar
deckard

size vs. content

Post by deckard »

Hi - forgive me if this is a silly question, but why does choosing size instead of content produce more results? What criteria is used for content?

Also, why doesn't every file have a hash associated with it? I know I can create hash files (SFV, MD5SUM, etc.), but it seems some files have a hash already embedded.

Would it be better to embed a hash into the files I want to scan (how does one do that anyway?), or would that defeat the purpose as then they would all be unique?

Thanks!
User avatar
therube
Posts: 615
Joined: Tue Jun 28, 2011 4:38 pm

Re: size vs. content

Post by therube »

> why does choosing size instead of content produce more results?

Just for that very reason.

You can have an Apple & a Pear that both weigh the same (same size), yet the content, Apple (applesauce) or Pear (pear juice) are different.

> What criteria is used for content?

Depending upon the method selected, either a hash of some sort (MD5, or whatever) or a byte-by-byte comparison. A hash is a numerical computation of the files data, & collisions (false results) are known to occur for some methods (though still not very likely, & probably not an issue for intended usages in Duplicate Cleaner), where byte-by-byte compares respective bytes from each file to prove equality.
User avatar
deckard

Re: size vs. content

Post by deckard »

Thank you very much for the clarification. While the answer does appear obvious, I was confused because I have used other duplicate scanners that perform a search by "size", yet they really must have been doing some form of content scan, as the results are different than Duplicate Cleaner.
Post Reply