bug?: loading from CSV

The best solution for finding and removing duplicate files.
User avatar
Emerson

bug?: loading from CSV

Post by Emerson »

So, I ran a scan on a fileset of 1M files I've been wanting to de-dupe and although it took four days, DC2 seemed to complete the scan and load up the list using ~1.2Gb of RAM. Not wanting to have to repeat the scan in the event DC locked up at any point, I exported the list to CSV.

I needed to do some other things on my machine besides run the dedupe at the time so I closed down DC thinking I could simply reload from CSV at a later time and complete what I had started.

The counter in the status bar quickly climbed up (to 545,000) and then DC2 has been stuck that way for the past 15 or so hours. It's pegging one of my processor cores at ~90% (the other cores are unused), and memory usage is fluctuating between around the 1Gb mark.

Why is it taking so long for DC2 to load a list of files??? I can understand that DC2 might want to check that a filepath is valid, and that the filesize checks out... but other than those calculation, what in the world could be taking so long.

Thanks for any insight or help with the issue.
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

Perhaps this is some internal misfunctioning in acquiring room for a number of items when reading from a file as opposed to the direct scan that seems to have had no problem with this large amount.

You may try to split you .csv file manually, if there are a number of groups of duplicates. See if you can get to work with a part of the file which keeps the counter below 545,000.
User avatar
Emerson

Post by Emerson »

I have split the CSV, it doesn't "not work". It simply takes an inordinate amount of time to load... it takes essentially just as long to load from CSV as it does to run a scan from scratch on a similar number of files. I'm curious to know what's going on under the hood that would lead to this.
User avatar
Fool4UAnyway

Post by Fool4UAnyway »

How small did you split? Try 10000 lines, for instance.
User avatar
Emerson

Post by Emerson »

Fool4uAnyway, my point is not that it can't be worked around. It's what I've been doing. I've been chopping the CSV into small pieces that only take like an hour to load, and dealing with it that way. My point is that it should *not* be taking that long to load a list from a CSV file. The whole point of being able to export a list of *already* processed files, is to be able to import that list again *more* quickly. The fact that importing 100, 10,000, or 100,000 files takes just as long as *scanning 100, 10,000, or 100,000 files respectively suggests to me there are some very unnecessary calculations going on in the background. I'd love to hear from DV what is going on under the hood when DC2 (or DC1 for that matter) tries to import a filelist from CSV.

** While I realize I can just chop stuff of F4UA, I'm working with a fairly large dataset and having to chop a file into 50+ pieces and run DC2 as many times is definitely not ideal.
User avatar
Emerson

Post by Emerson »

chop stuff up* (not of)
User avatar
DV

Post by DV »

Do you know how many lines total the csv has?
The process just throws the lines into the table, it doesn't do anything else. The only possible stumbling point is that it runs the 'Refresh' command after loading. This checks that every file exists and removes deleted files and redundant groups. If your scan included a drive that is perhaps not connected anymore, the time taken for each check on the non-existant drive to produce an error could create the delay you are experiencing.
Perhaps making the refresh optional could speed this up.
User avatar
Emerson

Post by Emerson »

If you could make the "refresh" command optional, I'd be more than happy to test it out and report back to let you know how it impacted the loading speed.

I assumed DC2 was checking to make sure the files it imported actually existed, but the process of loading files from CSV takes me about as long as it does to run a scan on a fileset of the same size (so for a CSV with 100,000 files, it takes about as long as scanning 100,000 files).

The drive and all the files that are in the CSV are intact and available, so whatever the issue is, it's not that.

I realize large file sets can cause problems for programs, but with something like DC2, large file sets is where it's most useful :P
User avatar
DV

Post by DV »

OK, I've tested this, and the refresh is slow (I will remove it, and add a warning in the message). The main problem is that it keeps refreshing the 'checked items' count for each line added. This isn't noticable on smaller lists but as it grows bigger it steadily slows. I'll fix this - I hope to have an updated beta out on Weds.
User avatar
DV

Post by DV »

A csv of 200,000 lines (with no refresh and the bug fixed) takes about 30secs to load.
Post Reply