Page 1 of 1
A Problem with Non-English Text Comparison?
Posted: Sun Sep 22, 2019 10:30 pm
by avinatbezeq
The following two text files img-001200.txt and img-001300.txt failed to match even at 50% similar (Regular mode, similar content). IMHO they are much more similar than 50%. Could this be that DCP has an issue with Hebrew text files?
Screen capture:
https://imgur.com/HuKrzDW
1200.txt:
https://file.io/g14DXz
1300.txt:
https://file.io/jTnUaP
Re: A Problem with Non-English Text Comparison?
Posted: Mon Sep 23, 2019 12:07 pm
by therube
(1200.txt is 404.)
I'll note that notepad has various means to save "Unicode"; Unicode, Unicode big endian, & UTF-8.
On my end, saving each way, big endian & UTF-8 are exactly the same (binary).
But "plain" Unicode is different - when compared using a binary comparison method - even though the files are the same size & even though they look the same (when opened in notepad).
(I guess that makes sense, given that ordering is different.)
So while "Unicode" starts off with,
$FFFE$ $5F00$
UTF-8 starts off with,
$FEFF$ $005F$
flipped.
That would make a binary compare totally different.
(A "textual" compare will match. [Now that also depends on just how the text compare goes about its' business.])
(How DC goes about things, I wouldn't know.)
Re: A Problem with Non-English Text Comparison?
Posted: Mon Sep 23, 2019 1:24 pm
by avinatbezeq
Thank you for your answer. AFAIK both documents were encoded as UTF-8 by Google while converting a Google-Document to a text file.
Please see the
here the UTF-8 difference comparison as done by Total Commander:
And
here is the Binary difference comparison done by Total Commander:
Re: A Problem with Non-English Text Comparison?
Posted: Mon Sep 23, 2019 6:58 pm
by therube
(Could you correct the link to 1200.txt.)
Re: A Problem with Non-English Text Comparison?
Posted: Tue Sep 24, 2019 12:45 pm
by avinatbezeq
Re: A Problem with Non-English Text Comparison?
Posted: Tue Sep 24, 2019 2:26 pm
by therube
On my end, 54% was the highest % that I could use that would find the most duplicates.
Altap Salamander mostly fails with Unicode (
display), though it is still able to compare successfully, including different encodings (UTF-8 vs Unicode).
https://i.postimg.cc/bykj8Lz9/Hebrew-te ... mander.png
(1. text mode, 2. binary)
It would appear that DC is using a binary algorithm for its Similar Content (as it does not find "the same" UTF-8 & Unicode files as "duplicates").
When I bumped % to 55, I lost one of my files - even though the file lost was an exact duplicate of 1 of the other 2 files that did remain as "duplicated"?
Re: A Problem with Non-English Text Comparison?
Posted: Tue Sep 24, 2019 4:56 pm
by avinatbezeq
Well, as you can see at
Beyond Compare 4's comparison result, there are
only four one char difference between the two files!
