Hash Database For Offline Duplicate Cleaning

The best solution for finding and removing duplicate files.
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Hash Database For Offline Duplicate Cleaning

Post by Ryan »

First I want to take a moment to congratulate the digitalvolcano team (however many that is) for what I believe is the best duplicate checking software ever on the market. As a bit of an aficionado of this category for many years, I have to say that your software is the most powerful I have ever seen (except for one feature that I will get to in a moment). Especially impressive is the Status, Protected, Master, Scan Again, Find uniques, Scan subfolders options, not only for the innovative nature of a few of these, but also for the beautiful way they are implemented. It's truly brilliant!

I thought I would rehash (if you will excuse my pun) a feature that I have suggested a couple of times in the past with the hope that it might now be something that could put on the feature list if you are looking for more innovative new features to add. There have been very few duplicate finders I know of that ever had this feature, I am only aware of two. I think I may have actually sent you the trial version of one of them a couple of years ago when we initially discussed it via email.

Here is what I am talking about.

Currently, you can only add paths to the "Folders to search" list if the storage device with the folder is actually attached to your computing device.

But what if you could add virtual storage devices that represent files that are not physically attached? That way you could dupe "live" files against files that are not currently present, or even "offline" files against other offline files and generate a list of duplicates (or if you want to dream of really far out stuff, generate an executable that you could run on the remote storage device to delete the duplicate files... just dreaming :)

Think of it... you could dupe against zillions of files and you don't even need to have the actual files present. What you have instead is a database of their hashes.

So DCP would allow you to create named databases of file hashes (containing say filename, date, size and hash... for my taste an SQLite database would be awesome). For example you could have a database of files stored on a DVD called "My Vacation Photos." Then you could load and unload the named databases in the same way you include a path in the "Folders to search" pane and use them the same way you use the live files. But you don't even need to have the media they are stored on it attached! It's a virtual path. So in the example, you could say, delete any duplicate files on your attached hard drive that you already have on the "My Vacation Photos" DVD. And you don't even have to connect the DVD!

Of course some actions would not be possible such as fuzzy image searches and some of the options would not be applicable ... and I know it is a huge task to implement this. But I think the immense power this gives you would take DCP to the next level! I hope you agree!

Thanks again for considering it. I can't even begin to tell you how useful this would be for me. And I'm hoping others can see the value in it for their own use as well.
User avatar
DigitalVolcano
Site Admin
Posts: 1717
Joined: Thu Jun 09, 2011 10:04 am

Re: Hash Database For Offline Duplicate Cleaning

Post by DigitalVolcano »

Thanks for the suggestion, and glad you like Duplicate Cleaner.

It's a good idea, being able to compare hashes "offline". As of v4.0 the hashes can be cached, so I guess being able to compare them wouldn't be a stretch - it would just need some careful design to make it useful (and safe).


I'm thinking you could have a method of adding cached paths to 'Paths to scan' list, based on what has been cached in the past. (It could present you with cached folder trees for selection). The only problem currently is that it will only have stored caches of files which have had a 'size' match in the past - so you'd need a method of harvesting complete hashes for an entire drive/folder for use later, outside of the normal scanning process.
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Re: Hash Database For Offline Duplicate Cleaning

Post by Ryan »

Thank you for your reply. Yes, I noticed that DC now can create a cache of certain previously scanned files (I assume for purposes of speeding up future scans). I also thought about this in the context of the "offline" feature I am suggesting.

I think that what would be needed though is some new functionality to allow the user to select a folder tree and then create a named database of hashes for all the files in that tree.

If names, sizes, paths and dates are also included in the database, it could serve several purposes including better identification of the offline file in the event that there are two files on the offline source with the same hash (size is not needed for this purpose, but can be used for other purposes). Another use for this info. is that it would be useful for the user to see what files on the offline source are matched in the event there are matches to the live files.

Regarding safety... I think you are right. That is so important with a duplicate cleaner... rule #1 is that you don't ever want any unintended data loss!

Certainly at least in its initial implementation, you can have a one way only deleting system. When it comes to deleting you can only delete the online files if they match an offline hash. You can't delete the offline files. Of course it would be up to the user to be sure the offline files actually exist if they want to retain those files. I used this system for many years and never had any problems.

But you are so right that it has to be very carefully designed so that it is rock solid and can be trusted!

I've never seen a program that allows you to say generate an executable file that you can run on the offline source after running a scan to delete the duplicate offline files! That's just sort of a science fiction dream I've had... maybe for some future thought. Right now, the ability to delete only the online files based on matching the offline hash database is what I have in mind.

I'm not underestimating the challenge involved here...it's quite a task! But it can be done, and it's immensely powerful!

There was a program that did this wonderfully that is no longer available or in development. I know that it ran on Windows XP, but not sure if it will run on any OS past that. I have a copy of it somewhere. If you would be interested in taking a look, please let me know and I will try to locate it and send it to you (please tell me where to send it).

Thanks again for considering this! If I can do anything to help, please call on me!
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Re: Hash Database For Offline Duplicate Cleaning

Post by Ryan »

Bump. :)

Just wanted to mention I'm still around and would still love to have this feature.

My initial test on Windows 7 in trying to get the only program I know of that can do this was not successful. And that program is out of development. The last Windows version I had it working on was Windows XP. So there is probably nothing out there that can do this on current Windows OSs. At least I have not been able to find anything after a pretty extensive search.

I dearly miss this feature. Once you use it, you won't be able to live without it.
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Re: Hash Database For Offline Duplicate Cleaning

Post by Ryan »

What the heck! I hope once a year isn't too often to bump this. I'm still dearly missing this capability. There is simply no other way to accomplish the same thing with anything I know of that's currently out there. Would be an awesome feature! Hope you don't mind if I bring it up again. :)
User avatar
DigitalVolcano
Site Admin
Posts: 1717
Joined: Thu Jun 09, 2011 10:04 am

Re: Hash Database For Offline Duplicate Cleaning

Post by DigitalVolcano »

You'll be pleased to hear that this feature is scheduled for Duplicate Cleaner 5:

Code: Select all

 0000306: [Scan Engine] Create virtual folders/drives for offline scanning
I can't give you a date for this though!

Note:
Another similar request here:
http://www.digitalvolcano.co.uk/board/v ... tual#p6423
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Re: Hash Database For Offline Duplicate Cleaning

Post by Ryan »

WOW! THANK YOU! That is fantastic! :o

I think this feature is a game changer in the Duplicate File management space. I don't know of any other program, currently still in development, that can do this. It is a tremendously powerful capability that has no good substitute.

I'm looking forward to trying it out! :D
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Re: Hash Database For Offline Duplicate Cleaning

Post by Ryan »

Some additional thoughts on this ....

An important feature of the hash database system I have in mind is to have a separate and discrete database *file* for each named database collection.

Also, these discrete named database files should be importable and exportable to and from DCP. This is important since the databases should be portable to use at any physical location where the offline data resides.

For example:

You have an external hard drive. You create a hash database of this drive and name it "My External Hard Drive."

You have a DVD. You create a hash database of your DVD and name it "My DVD."

Each of these hash databases is in a separate, transportable file. So you have two files one named "My External Hard Drive" and one named "My DVD."

You can export these files from DCP to be used or stored elsewhere, and you can also import any hash database files into DCP to be used for the duplicate checks.

When you import a named database file it becomes a "virtual path" of its own just as if it was a real path on local storage.

Just to put in another plug for SQLite, I think it might be a good choice of database format if you could consider it, but whatever database format you choose, hopefully it will not be a proprietary format so the databases can be managed with other software if necessary. I think that's also important.

I just want to be sure we are on the same wavelength about this since I think it is critical to the feature I have in mind.
User avatar
DigitalVolcano
Site Admin
Posts: 1717
Joined: Thu Jun 09, 2011 10:04 am

Re: Hash Database For Offline Duplicate Cleaning

Post by DigitalVolcano »

This was my thinking as well - separate hash databases.

Duplicate Cleaner has used SQLite since version 3.1, but with everything in the same DB. For version 5 the idea will be to split the databases (caches, settings and scans) up and generate a new one for each scan.
The Duplicate Cleaner database is currently encrypted.
Ryan
Posts: 11
Joined: Fri May 16, 2014 3:25 am

Re: Hash Database For Offline Duplicate Cleaning

Post by Ryan »

Still nothing out there that does this that I know of.

Is this still in the plans? Any progress?

Thanks. :D
Post Reply