Slowness when 'Same Content' is not used.
Posted: Fri Sep 03, 2010 3:03 pm
Dear Duplicate Cleaner developer!
I would like to know why a given operation is so slow. I selected some folders and informed that duplicates are files with same name, same size and same date. Duplicate Cleaner scans all folders and files in less than 20s. It found about 9,500 folders and about 400,000 files. Then it takes more than 20 minutes to show duplicates, consuming 100% of CPU along this period!!!
I don't understand why this has to be so slow and CPU consuming, as there is no need to read file contents. I made a small program that works as follows:
1. Create a list where each entry is a record for a found file.
2. Breaks each record into the following fields: 'name without path', date, size, 'path without name', group. The group is initially zero.
3. Sorts by the list by 'name without path', date, size, 'path without name'.
4. Iterates over the list to assign groups. The first group is 1, and each time the next record has a different value for the key ('name without path', date, size), the group number is incremented.
Doing so, I can find duplicates in less than 4s!! It takes more time to scan folders than to find duplicates!! Am I missing some important thing?
I would like to know why a given operation is so slow. I selected some folders and informed that duplicates are files with same name, same size and same date. Duplicate Cleaner scans all folders and files in less than 20s. It found about 9,500 folders and about 400,000 files. Then it takes more than 20 minutes to show duplicates, consuming 100% of CPU along this period!!!
I don't understand why this has to be so slow and CPU consuming, as there is no need to read file contents. I made a small program that works as follows:
1. Create a list where each entry is a record for a found file.
2. Breaks each record into the following fields: 'name without path', date, size, 'path without name', group. The group is initially zero.
3. Sorts by the list by 'name without path', date, size, 'path without name'.
4. Iterates over the list to assign groups. The first group is 1, and each time the next record has a different value for the key ('name without path', date, size), the group number is incremented.
Doing so, I can find duplicates in less than 4s!! It takes more time to scan folders than to find duplicates!! Am I missing some important thing?