Multi-threading bottleneck while processing image files

The best solution for finding and removing duplicate files.
Post Reply
Timur Born
Posts: 7
Joined: Sun Nov 12, 2017 2:27 pm

Multi-threading bottleneck while processing image files

Post by Timur Born »

Hello,

I am currently testing DC Pro compared to various other such applications and like what it has to offer. Before I used Alldup, but it lacked an option to only find files inside the same sub-folders of a base folder structure, which is possible in DC.

Now I run a Image mode scan of 15280 files to find photo similarities of "Very close match (97%)". During the "Calculating image metrics" phase CPU load varies between 6.25% (1 logical core fully utilized) and 50% (16 logical cores partially utilized by VCOMP120.DLL!vcomp_atomic_div_r8). It seems that the 6% (1 core) bottleneck keeps happening mostly when new files are read. Or rather, one specific thread keeps maxing out a single core nearly all the time and the presence/absence of other threads likely depend on the current status of said thread.

2116 6.24 3.649.458.312 clr.dll!GetMetaDataPublicInterfaceFromInternal+0x5330

Furthermore I noticed that DC's average CPU load for processing ORF (Olympus raw) and JPG files drops down to only about 18%, while it increases well over 30% when NEF (Nikon raw) files are processed. I don't know yet if the bottleneck thread drags down the former's average or if the higher resolution NEF files just need much more processing per file.

Overall I wonder: Would be possible to spread the work of that single thread to multiple threads to make more use of multi-core CPUs during that state of reading/decoding or whatever the thread is doing?

Thanks and best regards!
Timur Born
Posts: 7
Joined: Sun Nov 12, 2017 2:27 pm

Re: Multi-threading bottleneck while processing image files

Post by Timur Born »

I let it run for a few more hours and noticed that when it came back to NEF files it ran between 13 - 18% average, so no idea what I saw earlier.

When it came to lots of JPG files processing was limited to the single clr.dll thread, with only two other threads running well below 1% and the vcomp threads laying dormant.

Once that whole image metrics process was finished the final comparison process ran quickly. The whole run took roughly 5h40m and found 1813 duplicate files. I know that there are some duplicate JPG files of already present raw files, but I suspect that some of these duplicates will be images that are similar to each other, but still with minor differences.

It was a test run after all and I can simply get rid of the duplicate JPG files via simple file name (within same sub-dir) search, which is how I came to test DC as Alldup alternative anyway.
Timur Born
Posts: 7
Joined: Sun Nov 12, 2017 2:27 pm

Re: Multi-threading bottleneck while processing image files

Post by Timur Born »

I bought a license, removed the duplicate JPG files with same name and now run an image search for exact copies, because 97% still got too many different images.

What I just noticed is that the "Calculate hashed (image mode)" process is mostly single-threaded (with a second thread doing minor work below 1% CPU load). On an SSD this would benefit a lot from being multi-threaded, since the single core CPU load clearly is the bottleneck.
Timur Born
Posts: 7
Joined: Sun Nov 12, 2017 2:27 pm

Re: Multi-threading bottleneck while processing image files

Post by Timur Born »

DC ran its image metrics step for the past four hours, averaging at around 16% CPU load, often dropping down to a single thread. What I noticed is that it only works on one image at a time, so maybe this is where things could be sped up by (optionally) working on several images in parallel?!
User avatar
DigitalVolcano
Site Admin
Posts: 1725
Joined: Thu Jun 09, 2011 10:04 am

Re: Multi-threading bottleneck while processing image files

Post by DigitalVolcano »

Thanks for the feedback. There is some parallel processing in DC but it can always be improved. Some limitation comes from using external libraries for the RAW image support, and I find that disk i/o time can be a big bottleneck as well, even with multiple threads.
Post Reply