Hash cache doesn't notice when files are replaced with different data in filename collision
Posted: Fri Feb 04, 2022 1:31 am
Hash caching uses Date Modified to determine if the files have changed. If the date modified has not changed, the files are not hashed again because it is assumed the files couldn't have changed without affecting the date modified. This isn't necessarily true, as silent corruption can change the files without changing the date. Even if there is little opportunity for corruption in the duration cache is used, if a file is replaced in a filename collision, that is a quick way to introduce silent corruption while caches are being used.
I had two unique files in the latest scan, one version of the same image on each disk, since one had silent corruption (this time not enough to be noticeable besides hash). After replacing one copy with the other and scanning again in Duplicate Cleaner, the same two unique files show, with different SHA-1 hashes. Sending both to Hash Tool shows they are actually the same. This is because replacing one file with another of the same name did not affect the date, since both copies had the same date modified.
The solution to this is to have two hash modes. On the initial scan, use the SHA-1, and use quicker MD5 if the cache does not show an increase in date modified. If there is an increase in date, use SHA-1. Both levels of hash strength should be adjustable. Of course I deserve to be affected after posting about Shift + Delete, but this could affect good customers as well.
I had two unique files in the latest scan, one version of the same image on each disk, since one had silent corruption (this time not enough to be noticeable besides hash). After replacing one copy with the other and scanning again in Duplicate Cleaner, the same two unique files show, with different SHA-1 hashes. Sending both to Hash Tool shows they are actually the same. This is because replacing one file with another of the same name did not affect the date, since both copies had the same date modified.
The solution to this is to have two hash modes. On the initial scan, use the SHA-1, and use quicker MD5 if the cache does not show an increase in date modified. If there is an increase in date, use SHA-1. Both levels of hash strength should be adjustable. Of course I deserve to be affected after posting about Shift + Delete, but this could affect good customers as well.