Suggestions: faster algorithm and cache

boyallo · Post by **boyallo** » Sat Sep 19, 2009 12:26 am

Duplicate Cleaner is the most powerful, yet one of the easiest to use. The one thing it doesn't do better is speed. That's because some other duplicate finders do byte-by-byte comparisons.
Rather than scanning the entire file at once to generate a hash, scan the file in smaller blocks. If you have several large files to compare, this can save a lot of time if the files are different.
This can also slow down scans if the drive has to seek too much. So start with a small block, say 1MB, and increment the block size if the data continues to match, 2MB, 4MB, 8MB, etc.
Another trick is to read the last piece of the file first. If the files are different, you're most likely going to find a difference at the beginning or end of the file.

My second suggestion is to build a cache or index of scanned files. This way, instead of scanning the file again, you can simply load the hash from the cache.
Like synchronization software, look at the file size or 'last modified date' to determine if the file has changed. If the file has changed, rescan the file and update the cache.

DV · Post by DV » Mon Sep 21, 2009 8:28 pm

Thanks for the info - byte by byte checks are something slated for the next Duplicate Cleaner generation (2.0).
Caching is an interesting idea also if there is the demand for it.

citizen · Post by **citizen** » Mon Oct 04, 2010 5:14 pm

Interesting suggestion boyallo, that could make it much faster (for some file sets).
Perhaps a hybrid: If size matches, compare a CRC32 of just the last block (or first and last blocks), and only if those match continue with a larger hash (md5, opionally followed by sha256).