Page 1 of 1

Regex replace results in strange characters

Posted: Sat Nov 08, 2008 4:29 pm
by Juergen
I have the following situation I cannot solve. I have:
Line1












Line2

I want to end up:
Line1

Line2
That is I want to delete 2 or more empty lines down to just 1.

I tried this RegEx expression:
(\r\n *){3,}
and replace with:
\r\n\n

In the Regular Expression Tester it looks as expected. When I do it to a file it looks ok in the TextCrawler preview window. Opening the file in 'Word' also looks as expected. However opening it in an editor like "(Windows)Notepad" (or TEDNotepad), results in:
Line1

Line2

Further investigation, and using different text editors (like PSPad, KDiff3, WinMerge, Crimson Editor, Plato3, TextPad, RJ TextEd, UnicEdit, WordPad), which all show it correctly, indicate to me I have to grapically represent what happens.
With '(Windows)Notepad' (or TEDNotepad) it looks like:
Line1[**][*]
Line2
Where [**] represents two rectangles, which however appear to be only one character, and the [*] represents a rectangle, but only one character. Probably [\r\n] and [\n] from the regex expression.

Obviously I want it to be looking "right" in all cases. Can you help me?

Juergen

P.S. And if this is confusing, since this forum window shows it correct as well, I can send screen dumps as pdf files, if you tell me how.

Posted: Sun Nov 09, 2008 3:00 pm
by DV
This seems to work ok for me - looks good in normal notepad too. Not sure about have two \n's in the replace expression though.
Perhaps the file you are working on has some unicode in it? TC doesn't handle unicode text files properly.


Posted: Sun Nov 09, 2008 9:18 pm
by Juergen
The post was drafted in Notepad, and copy/pasted into the post. When copy/pasting the post back out into Notepad it looks ok for me also. Apparently the copy/past fixes it.
I can send a pdf and the txt file if you send me an email address where I can attach files.
(The two \n are remains from trying "this, that and something else")

Posted: Mon Nov 10, 2008 8:17 am
by DV
Sure - send to software (at) digitalvolcano.co.uk.



Posted: Fri Nov 14, 2008 9:48 pm
by Juergen
I was sick a few days so could not get back immediately. To zoom into the problem I was making new, minimum files and had trouble repeating my problems. Investigating that further, going all the way down to the hex code, I found the solution.

And, for the curious minded, Notepad is fussy how a new line is coded (replaced via Regex). <CR><NL> (\r\n) is ok, <NL><CR> (\n\r) is not. Who would have thought, just like a manual typewriter.

During copy/past other things happen.
Having this in hex:
41 0a 0D 0A 0d 42 0D 0A 0D 0A 43
changes to:
41 0d 0a 0D 0A 0d 0a 42 0D 0A 0D 0A 43

So apparently a single 0a or 0d is completed to 0D 0A. In this example twice, changing two to three returns.