Suggestions: faster algorithm and cache
Posted: Sat Sep 19, 2009 12:26 am
Duplicate Cleaner is the most powerful, yet one of the easiest to use. The one thing it doesn't do better is speed. That's because some other duplicate finders do byte-by-byte comparisons.
Rather than scanning the entire file at once to generate a hash, scan the file in smaller blocks. If you have several large files to compare, this can save a lot of time if the files are different.
This can also slow down scans if the drive has to seek too much. So start with a small block, say 1MB, and increment the block size if the data continues to match, 2MB, 4MB, 8MB, etc.
Another trick is to read the last piece of the file first. If the files are different, you're most likely going to find a difference at the beginning or end of the file.
My second suggestion is to build a cache or index of scanned files. This way, instead of scanning the file again, you can simply load the hash from the cache.
Like synchronization software, look at the file size or 'last modified date' to determine if the file has changed. If the file has changed, rescan the file and update the cache.
Rather than scanning the entire file at once to generate a hash, scan the file in smaller blocks. If you have several large files to compare, this can save a lot of time if the files are different.
This can also slow down scans if the drive has to seek too much. So start with a small block, say 1MB, and increment the block size if the data continues to match, 2MB, 4MB, 8MB, etc.
Another trick is to read the last piece of the file first. If the files are different, you're most likely going to find a difference at the beginning or end of the file.
My second suggestion is to build a cache or index of scanned files. This way, instead of scanning the file again, you can simply load the hash from the cache.
Like synchronization software, look at the file size or 'last modified date' to determine if the file has changed. If the file has changed, rescan the file and update the cache.