failure to detect duplicates
failure to detect duplicates
Does duplicate cleaner have any known bugs or issues with xpt files? I have a set of xpt files that I'm fairly certain are duplicates; the scan result has the files as all duplicates using a byte-level scan, but using the MD5 hash scan, none are duplicates. Any thoughts? Thanks!
Re: failure to detect duplicates
File type should not matter, particularly.
MD5 is known to have collisions.
Byte level is going to be the most thorough, & if that says they are dup's, then you would think they are.
If you ... oops. That's backwards.
---
If byte level shows as dup's then you would think that MD5 would too, unless there was a bug in the MD5 algorithm?
What do SHA-1 & SHA-256 show?
Relatively small files (from Mozilla?)? If so, zip them up & upload them somewhere.
(Actually looks like you may be able to upload directly here to the board.)
MD5 is known to have collisions.
Byte level is going to be the most thorough, & if that says they are dup's, then you would think they are.
If you ... oops. That's backwards.
---
If byte level shows as dup's then you would think that MD5 would too, unless there was a bug in the MD5 algorithm?
What do SHA-1 & SHA-256 show?
Relatively small files (from Mozilla?)? If so, zip them up & upload them somewhere.
(Actually looks like you may be able to upload directly here to the board.)
Re: failure to detect duplicates
Thanks, Rube. I ran SHA-1 & SHA-256 and the results are that same as MD5: no duplicates. This is puzzling since the byte-level is the same, properties exactly same, and content during thorough inspection appears the same. I've only seen this occur with xpt files thus far.
Has anyone else used duplicate cleaner on any xpt files (or other statistical modeling program output files)?
Has anyone else used duplicate cleaner on any xpt files (or other statistical modeling program output files)?
- DigitalVolcano
- Site Admin
- Posts: 1804
- Joined: Thu Jun 09, 2011 10:04 am
Re: failure to detect duplicates
The file type shouldn't really make a difference - DC just treats everything as binary data.
Do you have any other options specified (eg Same date, filename, etc)?
How many files are affected? What size are the files? I'd be interested in screenshots from scans on both sets, if that's possible.
thanks!
Do you have any other options specified (eg Same date, filename, etc)?
How many files are affected? What size are the files? I'd be interested in screenshots from scans on both sets, if that's possible.
thanks!
Re: failure to detect duplicates
MaxG, if with MD5 you do not find dulicates, don't use MD5. Fair simple.