New duplicate criteria
Posted: Sun Dec 17, 2017 3:17 pm
Hi,
Sometimes I need to find similar but not identical files, and no current system available on Duplicate cleaner can help.
Therefore I suggest to add 2 new criteria.
1) similar size: let the user decide a tolerance for sizes, let's say that user decides to tolarate 5% difference then a file 1.000 bytes long can match with files from 950 to 1.050 bytes. Of course this is mutually exclusive with same size
2) similar name: this is trickier. You could use a similarity algorithm (fuzzy search) , like this https://en.wikipedia.org/wiki/Approxima ... g_matching and let the user decide how similar must be names to be considered identical. This is mutually exclusive with same name.
The actual use case is having tons of files created with different zip level (so slightly different size) and with different naming conventions.
let's say that I have
01-kittens.zip 1.000 bytes
1-Kittens.zip 998 bytes
They appear different, but in my context I should consider them equal, so I would use a similar size tolerance of 3% and a suitable similarity index (it depends on the actual algorithm you would implement for fuzzy search) in order to find this "duplicate".
Of course is up to te user to apply those criteria with responsability and combine them with other in order to actually find duplicates, but I think hey would really help to make this great program even better.
Regards
Luca
Sometimes I need to find similar but not identical files, and no current system available on Duplicate cleaner can help.
Therefore I suggest to add 2 new criteria.
1) similar size: let the user decide a tolerance for sizes, let's say that user decides to tolarate 5% difference then a file 1.000 bytes long can match with files from 950 to 1.050 bytes. Of course this is mutually exclusive with same size
2) similar name: this is trickier. You could use a similarity algorithm (fuzzy search) , like this https://en.wikipedia.org/wiki/Approxima ... g_matching and let the user decide how similar must be names to be considered identical. This is mutually exclusive with same name.
The actual use case is having tons of files created with different zip level (so slightly different size) and with different naming conventions.
let's say that I have
01-kittens.zip 1.000 bytes
1-Kittens.zip 998 bytes
They appear different, but in my context I should consider them equal, so I would use a similar size tolerance of 3% and a suitable similarity index (it depends on the actual algorithm you would implement for fuzzy search) in order to find this "duplicate".
Of course is up to te user to apply those criteria with responsability and combine them with other in order to actually find duplicates, but I think hey would really help to make this great program even better.
Regards
Luca