Similar Content Search Criteria

The best solution for finding and removing duplicate files.
Post Reply
david15151500
Posts: 2
Joined: Fri Mar 23, 2012 6:50 pm

Similar Content Search Criteria

Post by david15151500 »

I am testing out the trial version of the Duplicate Cleaner 3.0.4 product and set up some test files to see how the program works. I have a 2010 Word Document saved as a .docx and copied the file in the same folder. I opened up the file and switched the location of a single character 'a' with a single character 'e' to make the files marginally different (the file sizes are less than a KB different). I then saved it as a 97-2003 document to have the .doc ending.

When I run a search to find files with similar content, I can never get the program to identify those two documents as being similar. I've tried 90% similar and 60% similar, and neither one finds it. To test out if the docx vs doc mattered, I copied the .docx file again and renamed it as a .xslx file, but didn't change the content and that file is found as a duplicate at both 90 and 60%, so I don't think it's because of the file extension.

Is there any known bugs about the similar content percentage function, or do you have a way to fix this?

Thanks.
User avatar
DigitalVolcano
Site Admin
Posts: 1731
Joined: Thu Jun 09, 2011 10:04 am

Re: Similar Content Search Criteria

Post by DigitalVolcano »

Docx and doc are totally different formats internally - docx is basically a zip file, so I'm not surprised there is no similarity. Not sure of a way round this when checking on a binary level.
david15151500
Posts: 2
Joined: Fri Mar 23, 2012 6:50 pm

Re: Similar Content Search Criteria

Post by david15151500 »

I see, so trying to find similarities in .docx files is difficult because of the compressed nature of the file.

I created two .docx files that are very similar and the 60% filter could not find them as duplicates. I went to each file and resaved them as .doc files and the 60% filter found them without changing the content.
Post Reply