A Problem with Non-English Text Comparison?

The best solution for finding and removing duplicate files.
Post Reply
avinatbezeq
Posts: 7
Joined: Tue Mar 26, 2019 9:42 pm

A Problem with Non-English Text Comparison?

Post by avinatbezeq »

The following two text files img-001200.txt and img-001300.txt failed to match even at 50% similar (Regular mode, similar content). IMHO they are much more similar than 50%. Could this be that DCP has an issue with Hebrew text files?

Screen capture: https://imgur.com/HuKrzDW

1200.txt: https://file.io/g14DXz

1300.txt: https://file.io/jTnUaP
User avatar
therube
Posts: 614
Joined: Tue Jun 28, 2011 4:38 pm

Re: A Problem with Non-English Text Comparison?

Post by therube »

(1200.txt is 404.)

I'll note that notepad has various means to save "Unicode"; Unicode, Unicode big endian, & UTF-8.
On my end, saving each way, big endian & UTF-8 are exactly the same (binary).
But "plain" Unicode is different - when compared using a binary comparison method - even though the files are the same size & even though they look the same (when opened in notepad).

(I guess that makes sense, given that ordering is different.)
So while "Unicode" starts off with,
$FFFE$ $5F00$
UTF-8 starts off with,
$FEFF$ $005F$
flipped.

That would make a binary compare totally different.
(A "textual" compare will match. [Now that also depends on just how the text compare goes about its' business.])


(How DC goes about things, I wouldn't know.)
avinatbezeq
Posts: 7
Joined: Tue Mar 26, 2019 9:42 pm

Re: A Problem with Non-English Text Comparison?

Post by avinatbezeq »

Thank you for your answer. AFAIK both documents were encoded as UTF-8 by Google while converting a Google-Document to a text file.

Please see the here the UTF-8 difference comparison as done by Total Commander:

And here is the Binary difference comparison done by Total Commander:
User avatar
therube
Posts: 614
Joined: Tue Jun 28, 2011 4:38 pm

Re: A Problem with Non-English Text Comparison?

Post by therube »

(Could you correct the link to 1200.txt.)
User avatar
therube
Posts: 614
Joined: Tue Jun 28, 2011 4:38 pm

Re: A Problem with Non-English Text Comparison?

Post by therube »

On my end, 54% was the highest % that I could use that would find the most duplicates.

Altap Salamander mostly fails with Unicode (display), though it is still able to compare successfully, including different encodings (UTF-8 vs Unicode).

https://i.postimg.cc/bykj8Lz9/Hebrew-te ... mander.png
(1. text mode, 2. binary)


It would appear that DC is using a binary algorithm for its Similar Content (as it does not find "the same" UTF-8 & Unicode files as "duplicates").


When I bumped % to 55, I lost one of my files - even though the file lost was an exact duplicate of 1 of the other 2 files that did remain as "duplicated"?
avinatbezeq
Posts: 7
Joined: Tue Mar 26, 2019 9:42 pm

Re: A Problem with Non-English Text Comparison?

Post by avinatbezeq »

Well, as you can see at Beyond Compare 4's comparison result, there are only four one char difference between the two files!

Image
Post Reply