Bit for bit identicals files not seen as duplicates.

The best solution for finding and removing duplicate files.
Post Reply
HiTechHiTouch
Posts: 15
Joined: Sun Jul 29, 2012 4:33 pm

Bit for bit identicals files not seen as duplicates.

Post by HiTechHiTouch »

I've got a couple of LARGE trees of .jpg files, one 38G and the other 14G. The contain numerous binary duplicate files in a generally matching folder structure.

Another program, Beyond Compare*, finds 10G bytes -- 7500+ files -- with duplicate content and duplicate file/path names (from the top of each tree downward). The ONLY difference is that timestamps in the two trees differ by 1 hour (due to way Windows handles daylight saving time).

(Many, many of these duplicates were created by my spouse copying files instead of moving them as she sorted pictures. They are true duplicate photos, courtesy of Windows Explorer drag and drop between different disk volumes.)

So I fired up DC, added the root of each tree to Scan Locations, deselected "scan against self". On the Search Criteria tab, I selected the "Image Mode" sub-tab and set 100% similar, un-checking all boxes.

In particular "Same Created Date" and "Same Modified Date" are unchecked, suggesting that the file timestamps will NOT be considered in the comparison.

After "Scan Now", the summary shows "32227/32227 Files Scanned (40.5 GB)" with "0 Groups of duplicated" and "0 Files have duplicates (0 Bytes)"

So what am I doing wrong here, please?


If I change the % to 99%, I get the zillion duplicates I expect, but ...

At 99%, groups now include pictures which were saved at reduced jpg quality levels to make the file sizes smaller. I was hoping that reduced quality/smaller sizes would not be considered duplicates until I drop the match percentage a few more points.

-----
* Beyond Compare by Scooter Software is aimed at source code control, does not recognize the content of audio or pictures as such, and requires the folder tree structures to match. But for bit-to-bit compares, it's unimpeachable.
User avatar
therube
Posts: 615
Joined: Tue Jun 28, 2011 4:38 pm

Re: Bit for bit identicals files not seen as duplicates.

Post by therube »

> So what am I doing wrong here, please?

AFAICT, nothing*.

I'll confirm that a 100% image scan, with no other selections, or not, will never find any dups.
Guess you have to have an understanding of what these "metrics" are & how they play in to the image duplicate finding process? (I don't know that it is documented anywhere.)

* Knowing ahead of time, or even believing it to be the case, that the files are content alike, then using Image (or Audio) Mode is going to be highly inefficient compared to Normal Mode.

Better to knock out what you can, quickly & efficiently, using Normal, then fine tuning your search criteria using other "lossy" methods to weed out "other dups" (like same except for tag or ...).
HiTechHiTouch
Posts: 15
Joined: Sun Jul 29, 2012 4:33 pm

Re: Bit for bit identicals files not seen as duplicates.

Post by HiTechHiTouch »

I'd definitely like some documentation on what the %ages are.

The compare problem is multidimensional:
1. How close is the metadata? Which has 'better' metadata?
2. How do the images differ from editing -- Cropping? Resizing? Compression?
3. How similar are the images, for example multiple snaps of the same scene?

The matching algorithm seems to find case #3 pretty well -- and that's the hardest case. Case 2 could still be a lot of work give the various resizing algorithms. Case 1 is easy until you try and merge conflicting information.

Bottom line, at this point I need pixel identical (unedited) images, and if there's different metadata, which duplicate is the preferred one (e.g. it has captions and/or people but the other doesn't). Later I will want similar (same scene, camera/subject slightly moved) ignoring metadata.

So how do I ask for the kind of comparison I want?

Bit by bit falls short because .jpg (and others) are package files. Files could have identical payloads stowed in different order, which requires a "logical" comparison. And music has the same problems as photos...
User avatar
DigitalVolcano
Site Admin
Posts: 1731
Joined: Thu Jun 09, 2011 10:04 am

Re: Bit for bit identicals files not seen as duplicates.

Post by DigitalVolcano »

The % match is very fuzzy. It works by comparing reduced, normalized versions of the images against each other, so is great and fast for finding similar images, but not ideal if you want exact matches only.

A new option has been added to 3.0.8 -'Match Resolution'. This will prevent it from grouping full resolution images with low resolution thumbnails.

An gap in Duplicate Cleaner, as you have pointed out, is for exact matching image data (independent of metadata, format, etc) - something to possibly be added in a future update.
Post Reply