Hash cache doesn't notice when files are replaced with different data in filename collision

The best solution for finding and removing duplicate files.
Post Reply
Callistemon
Posts: 85
Joined: Fri Jun 25, 2021 5:15 am

Hash cache doesn't notice when files are replaced with different data in filename collision

Post by Callistemon »

Hash caching uses Date Modified to determine if the files have changed. If the date modified has not changed, the files are not hashed again because it is assumed the files couldn't have changed without affecting the date modified. This isn't necessarily true, as silent corruption can change the files without changing the date. Even if there is little opportunity for corruption in the duration cache is used, if a file is replaced in a filename collision, that is a quick way to introduce silent corruption while caches are being used.

I had two unique files in the latest scan, one version of the same image on each disk, since one had silent corruption (this time not enough to be noticeable besides hash). After replacing one copy with the other and scanning again in Duplicate Cleaner, the same two unique files show, with different SHA-1 hashes. Sending both to Hash Tool shows they are actually the same. This is because replacing one file with another of the same name did not affect the date, since both copies had the same date modified.

The solution to this is to have two hash modes. On the initial scan, use the SHA-1, and use quicker MD5 if the cache does not show an increase in date modified. If there is an increase in date, use SHA-1. Both levels of hash strength should be adjustable. Of course I deserve to be affected after posting about Shift + Delete, but this could affect good customers as well.
User avatar
therube
Posts: 615
Joined: Tue Jun 28, 2011 4:38 pm

Re: Hash cache doesn't notice when files are replaced with different data in filename collision

Post by therube »

On the initial scan, use the SHA-1, and use quicker MD5 if the cache does not show an increase in date modified.
Just babbling...

MD5 need not necessarily be quicker then SHA-1.
There is xxhash - which is probably the quickest hash out there.

There are caveats.

On a slow device, the hash chosen most likely will make no difference whatsoever.
(I've got a 30MB/s external HDD - a SLUG, & hash matters not.)

And there may be cases, perhaps depending on file size, where the faster xxhash may only do "as well" as (a slower) MD5/SHA-1...
And of course whatever method is used, the coding of such (or library used) has to be efficient.


As far as switching from SHA-1 to a "quicker" MD5, I'm thinking that's not really going to be of benefit, cause the only way to know for sure (that you don't have bitrot) is to scan both ends again, & compare them to the existing (stored) hash value.


(Note that xxhash is not intended to be a cryptographic hash, which really doesn't matter in a use case like here.)

See also:

voidhash
voidhash, the ramble
Detecting and preventing HFS Plus/NTFS bit rot
Callistemon
Posts: 85
Joined: Fri Jun 25, 2021 5:15 am

Re: Hash cache doesn't notice when files are replaced with different data in filename collision

Post by Callistemon »

Maybe the CPU hashing wouldn't have an effect on slowness if Duplicate Cleaner used a normal amount of CPU, but Duplicate Cleaner is so restricted with CPU that even with a very slow HDD, the CPU will be responsible for how long it takes.
Post Reply