Page 1 of 1

Multi-threading bottleneck while processing image files

Posted: Sun Nov 12, 2017 2:45 pm
by Timur Born
Hello,

I am currently testing DC Pro compared to various other such applications and like what it has to offer. Before I used Alldup, but it lacked an option to only find files inside the same sub-folders of a base folder structure, which is possible in DC.

Now I run a Image mode scan of 15280 files to find photo similarities of "Very close match (97%)". During the "Calculating image metrics" phase CPU load varies between 6.25% (1 logical core fully utilized) and 50% (16 logical cores partially utilized by VCOMP120.DLL!vcomp_atomic_div_r8). It seems that the 6% (1 core) bottleneck keeps happening mostly when new files are read. Or rather, one specific thread keeps maxing out a single core nearly all the time and the presence/absence of other threads likely depend on the current status of said thread.

2116 6.24 3.649.458.312 clr.dll!GetMetaDataPublicInterfaceFromInternal+0x5330

Furthermore I noticed that DC's average CPU load for processing ORF (Olympus raw) and JPG files drops down to only about 18%, while it increases well over 30% when NEF (Nikon raw) files are processed. I don't know yet if the bottleneck thread drags down the former's average or if the higher resolution NEF files just need much more processing per file.

Overall I wonder: Would be possible to spread the work of that single thread to multiple threads to make more use of multi-core CPUs during that state of reading/decoding or whatever the thread is doing?

Thanks and best regards!

Re: Multi-threading bottleneck while processing image files

Posted: Sun Nov 12, 2017 6:17 pm
by Timur Born
I let it run for a few more hours and noticed that when it came back to NEF files it ran between 13 - 18% average, so no idea what I saw earlier.

When it came to lots of JPG files processing was limited to the single clr.dll thread, with only two other threads running well below 1% and the vcomp threads laying dormant.

Once that whole image metrics process was finished the final comparison process ran quickly. The whole run took roughly 5h40m and found 1813 duplicate files. I know that there are some duplicate JPG files of already present raw files, but I suspect that some of these duplicates will be images that are similar to each other, but still with minor differences.

It was a test run after all and I can simply get rid of the duplicate JPG files via simple file name (within same sub-dir) search, which is how I came to test DC as Alldup alternative anyway.

Re: Multi-threading bottleneck while processing image files

Posted: Sun Nov 12, 2017 7:54 pm
by Timur Born
I bought a license, removed the duplicate JPG files with same name and now run an image search for exact copies, because 97% still got too many different images.

What I just noticed is that the "Calculate hashed (image mode)" process is mostly single-threaded (with a second thread doing minor work below 1% CPU load). On an SSD this would benefit a lot from being multi-threaded, since the single core CPU load clearly is the bottleneck.

Re: Multi-threading bottleneck while processing image files

Posted: Mon Nov 13, 2017 11:25 am
by Timur Born
DC ran its image metrics step for the past four hours, averaging at around 16% CPU load, often dropping down to a single thread. What I noticed is that it only works on one image at a time, so maybe this is where things could be sped up by (optionally) working on several images in parallel?!

Re: Multi-threading bottleneck while processing image files

Posted: Tue Nov 21, 2017 12:12 pm
by DigitalVolcano
Thanks for the feedback. There is some parallel processing in DC but it can always be improved. Some limitation comes from using external libraries for the RAW image support, and I find that disk i/o time can be a big bottleneck as well, even with multiple threads.