DigitalVolcano Software Support

Posted: **Sun Sep 22, 2019 10:30 pm**

The following two text files img-001200.txt and img-001300.txt failed to match even at 50% similar (Regular mode, similar content). IMHO they are much more similar than 50%. Could this be that DCP has an issue with Hebrew text files?

Screen capture: https://imgur.com/HuKrzDW

1200.txt: https://file.io/g14DXz

1300.txt: https://file.io/jTnUaP

Posted: **Mon Sep 23, 2019 12:07 pm**

(1200.txt is 404.)

I'll note that notepad has various means to save "Unicode"; Unicode, Unicode big endian, & UTF-8.
On my end, saving each way, big endian & UTF-8 are exactly the same (binary).
But "plain" Unicode is different - when compared using a binary comparison method - even though the files are the same size & even though they look the same (when opened in notepad).

(I guess that makes sense, given that ordering is different.)
So while "Unicode" starts off with,
$FFFE$ $5F00$
UTF-8 starts off with,
$FEFF$ $005F$
flipped.

That would make a binary compare totally different.
(A "textual" compare will match. [Now that also depends on just how the text compare goes about its' business.])

(How DC goes about things, I wouldn't know.)

Posted: **Mon Sep 23, 2019 1:24 pm**

Thank you for your answer. AFAIK both documents were encoded as UTF-8 by Google while converting a Google-Document to a text file.

Please see the here the UTF-8 difference comparison as done by Total Commander:

And here is the Binary difference comparison done by Total Commander:

Posted: **Mon Sep 23, 2019 6:58 pm**

(Could you correct the link to 1200.txt.)

Posted: **Tue Sep 24, 2019 12:45 pm**

Now I understand your "1200 is 404" comment

Here are the links:

1200.txt (https://drive.google.com/open?id=1Z_2TieNxO78mWCVuNd3D4opBGXKuGYJP)

1300.txt (https://drive.google.com/open?id=1L-jOLtRENp3gzfIh8KmxtVyxQy6unIt4)

Thank you!

Posted: **Tue Sep 24, 2019 2:26 pm**

On my end, 54% was the highest % that I could use that would find the most duplicates.

Altap Salamander mostly fails with Unicode (display), though it is still able to compare successfully, including different encodings (UTF-8 vs Unicode).

https://i.postimg.cc/bykj8Lz9/Hebrew-te ... mander.png
(1. text mode, 2. binary)

It would appear that DC is using a binary algorithm for its Similar Content (as it does not find "the same" UTF-8 & Unicode files as "duplicates").

When I bumped % to 55, I lost one of my files - even though the file lost was an exact duplicate of 1 of the other 2 files that did remain as "duplicated"?

Posted: **Tue Sep 24, 2019 4:56 pm**

Well, as you can see at Beyond Compare 4's comparison result, there are only four one char difference between the two files!

DigitalVolcano Software Support

A Problem with Non-English Text Comparison?

A Problem with Non-English Text Comparison?

Re: A Problem with Non-English Text Comparison?

Re: A Problem with Non-English Text Comparison?

Re: A Problem with Non-English Text Comparison?

Re: A Problem with Non-English Text Comparison?

Re: A Problem with Non-English Text Comparison?

Re: A Problem with Non-English Text Comparison?