Similar Content Percentage question

The best solution for finding and removing duplicate files.
Post Reply
bobsage
Posts: 9
Joined: Tue Jan 10, 2017 4:49 am

Similar Content Percentage question

Post by bobsage »

I just want to clarify how this option works before I do a large scan on my PC with it.

Say I have two of the same videos. 1 is 1,798,317,337 bytes and the other is 1,796,629,143 bytes.

Both are esentially the same video, with maybe 1 second added to one of them. Hence the size difference. If I have similar content set to 99%, would this qualify as a match?

I'm not sure if this match goes by byte similarity, actual content (aka checks the videos themselves are 99% similar) or what.

Please clarify if possible.
User avatar
DigitalVolcano
Site Admin
Posts: 1717
Joined: Thu Jun 09, 2011 10:04 am

Re: Similar Content Percentage question

Post by DigitalVolcano »

It is byte similarity. Whether you get a match at 99% will depending on the encoding similarity, headers, metadata, etc.
User avatar
therube
Posts: 614
Joined: Tue Jun 28, 2011 4:38 pm

Re: Similar Content Percentage question

Post by therube »

To note...


I've got some videos that when I muxed the audio & video, I'd experiment with an offset, say 100ms delay in the audio.
And while output file sizes may be exact, & while the video portion & the audio portions are exact, because of the delay, the files are different - substantially to a file comparison program.

So for something like that, DC, Similar Content, will not find them as "duplicates", though they "are".


In other instances, I encoded the same content, only using different versions of ffmpeg.
ffmpeg writes its' version number used to encode the file into the output file.
So while I may have used the same encoder options, & while the output files end up being the same size (along with the content), the files themselves are different, because of the ffmpeg "header" (if you will) version number.

In this case, DC, Similar Content, should find the files to be "duplicates" - because they are, essentially.
Post Reply