Duplicate Folders Problem/Bug

The best solution for finding and removing duplicate files.
User avatar
DigitalVolcano
Site Admin
Posts: 1725
Joined: Thu Jun 09, 2011 10:04 am

Re: Duplicate Folders Problem/Bug

Post by DigitalVolcano »

I suspect the odd results are from the 'Single file' groups being removed. Note they are dropped from the database, not just hidden, which is why it doesn't change again when toggling the setting.
I need to check this behavior, and see how it affects the duplicate folder grouping when these files are dropped.

Note - a grey file means it isn't in the DC database.

Re-suggestions
1 - I'll take this into account - I'm working on updating the text to try and make some of the concepts less confusing. Your re-writes sound better :)

2 - I'm reworking this for v5

3- Will probably split System folders and hidden files into two options (with hidden on by default).
StefanM
Posts: 10
Joined: Wed Sep 18, 2019 8:41 am

Re: Duplicate Folders Problem/Bug

Post by StefanM »

DigitalVolcano wrote: Sat Sep 21, 2019 5:03 pm
German Translation
After taking a closer look at the existing German translation, I found quite a few severe mistakes: Even false contents.
In the second half it looked like a Google translation. :(

So I decided to revise the language file.

I send you a pdf where are changes are tracked and a new revised language file by mail to software@digitalvolcano.co.uk.
Hope it reaches you there :-)
StefanM
Posts: 10
Joined: Wed Sep 18, 2019 8:41 am

Re: Duplicate Folders Problem/Bug

Post by StefanM »

DigitalVolcano wrote: Sat Sep 21, 2019 5:03 pm 3- Will probably split System folders and hidden files into two options (with hidden on by default).
'Hidden on' by default, in my eyes is not such a good idea:
When you e.g. have folders with pictures in it, they often also contain a hidden database file. When you let DCP compare those folders and the user deletes a possible duplicate folder, an orphaned database file will remain also in the not fully deleted duplicate folder.
The majority of users will probably never change those default settings.
DigitalVolcano wrote: Sat Sep 21, 2019 5:03 pm Note - a grey file means it isn't in the DC database.
A few questions
As long as you are not giving away any company secrets, maybe you could answer me a few questions. On one hand I am very interested in how DCP works, and on the other hand, this information would help me in finding bugs a bit easier.

But first a question about a user case of mine:
Let's say, I have folders with (duplicate) pictures. In folder A there are only 3 pictures, in folder B there are 20 pictures. Those 3 pictures in folder A all have duplicates in folder B.

Of course, I want to delete the duplicates in folder A, with only 3 pictures.

If I know all of the above, then it's easy to decide that folder B is the one to keep, it's easy using 'Mark by location' in the Selection Assistant.

But, is there a way to let DCP assist me in finding that decision?
Of course, I could use 'Show Folder in Windows Explorer'. But this is not practical, as I would have to do this for every duplicate.

Anyway to achieve my goal with any settings/config?


And now, here are my questions on how DCP works
Maybe it will save you some time, if you just comment my observations (right or wrong).

In the following, I assume that I am just searching for identical files with identical MD5 hash.

Step 1:
DCP creates a list of all files and folders, size, creation and modification time included.

Step 2:
It checks, which files have identical sizes.

Step 3:
For all of those files quick hashes are being calculated, to exclude all files that already turn out to be different after this quick hash calculation. In parallel already in this step a full MD5 hash is being calculated for a number of files (which are smaller than…?)

May I ask, how the quick hash is being calculated? Is it similar to ed2k hash calculation where larger files are being split up in chunks and a hash is being calculated for each chunk separately?
I would do it this way by defining maybe just 3 smaller chunks per file (close to the start of the file, in the middle of the file, and close to the end of the file.


Step 4:
For all remaining files that 'passed' the quick hash comparison, a full MD5 hash is being calculated.

Step 5:
According to the result the duplicate file list and the duplicate folder list is being populated.

Some additional questions to that:
As I read (and found out myself), the database file is encrypted, which is too bad. Because I cannot edit it :( that way.

1. But can you tell me, how DCP saves hashes it once had computed?
Is it the exact path information? And if I change just one single letter in the path, DCP won't 'know' that file anymore?

2. There is probably a size limit for the database file? What is this limit? And what happens, once it has been reached, (first in first out)?

And I learned that there will be different database files in version 5 :) QUOTE: "to split the databases (caches, settings and scans) up and generate a new one for each scan."


ok, hope that this was not too many questions…
User avatar
DigitalVolcano
Site Admin
Posts: 1725
Joined: Thu Jun 09, 2011 10:04 am

Re: Duplicate Folders Problem/Bug

Post by DigitalVolcano »

re: Hidden files - I meant search for hidden files ON by default.


re: Use case (selecting folder with less files). You can't do this at the moment but I'm working on making the Sel Assistant smarter (e.g. Mark all but one in each group but preferring X and then Y)

re:Quick Hash - it's just a small chunk at the start of the file in v4, nothing too fancy. Then full hash if a match.

DC stores hashes (cache) by full path+filename, along with size, Created and Modified date. If anything changes then the hash is recalculated.

There is no size limit and the database is never pruned. Though even a DB with a million files won't be particularly large. You can clean the cache manually from the Options tab.

Yes, the scan, options and cache are split into separate files in V5. Also the scan files won't be encrypted so advanced users can edit/mark files via SQL if they like :)
StefanM
Posts: 10
Joined: Wed Sep 18, 2019 8:41 am

Re: Duplicate Folders Problem/Bug

Post by StefanM »

DigitalVolcano wrote: Fri Sep 27, 2019 8:20 pm Also the scan files won't be encrypted so advanced users can edit/mark files via SQL if they like :)
Thanks for your answers.
And I really like your decision that I could edit/mark files via SQL.
Already had one scenario for that:
Just the drive letter of an external HDD had changed, so DC had to re-calculate all hashes again.
This would be one simple scenario where editing would be a very useful option.

And I had one more question which is important for me:
Is there an 'official' way to run DC portable?

Of course, I can also try myself to create such a version ;)

Thanks once again!
User avatar
DigitalVolcano
Site Admin
Posts: 1725
Joined: Thu Jun 09, 2011 10:04 am

Re: Duplicate Folders Problem/Bug

Post by DigitalVolcano »

It you copy or install it to a USB stick it should run.

You can also modify the database.ini file in the program directory to store the program's database and settings to your USB drive.

The only thing is that the program will write licence information to the local machine when registered - it will ask for registration on each new machine.
StefanM
Posts: 10
Joined: Wed Sep 18, 2019 8:41 am

Re: Duplicate Folders Problem/Bug

Post by StefanM »

DigitalVolcano wrote: Sat Sep 28, 2019 12:21 pm The only thing is that the program will write licence information to the local machine when registered - it will ask for registration on each new machine.
Writing license information to the local machine when registered
and
asking for registration on each new machine
is not really 'portable'

And, from what I can see, it does write a lot of information to the registry, e.g. last folders used, ...

I will send you some additional information on a 'real' portable version by mail...
Post Reply