Dot Matches Newline ---- Not Working as Expected

RegEx problems forum (archived)
rymcco
Posts: 1
Joined: Fri Jan 22, 2021 3:44 am

Dot Matches Newline ---- Not Working as Expected

Post by rymcco »

I have a large volume of text files that are XML based, but need to be cleaned as have 1 to 3 lines of garbage at the top of them, preventing me from performing ETL as my ultimate goal is to get these in a SQL server DB.

Here is an example of the first 3 lines of one:
ÜTïx Hh 6
Default Table<
å¥ <?xml version="1.0" standalone="no"?>

So I am trying to simply remove EVERYTHING on these files before "<?xml"

The reg ex I am using in the Test bench on TextCrawler 3 is:
Regular Expression: .*(xml)
Replace Text: <?xml

Now this works PERFECTLY in the testing form as long as I have "Dot Matches Newline" checked off.

HOWEVER, when I actually go to perform this task, the result is not the same. The text editor is only removing the text on line 3 preceeding <?xml

It is NOT removing line 1 or 2, despite toggling on and off Dot Matches Newline and/or Multi-line Anchors.

I am COMPLETELY new to Reg Ex. But since it is working as intended in the Test window, I am wondering if it is a bug. I really need to remove the beginning 1-3 lines of garbage from these files. Some files are 1 line, some are 2, some are 3. I need all 1,000+ files to start with simply "<?xml" as the first characters on the first line.

I really appreciate any support on this as is for a project I am trying to jam through asap at work. Let me know if you need more information.

Thanks in advance

Ryan
User avatar
DigitalVolcano
Site Admin
Posts: 1804
Joined: Thu Jun 09, 2011 10:04 am

Re: Dot Matches Newline ---- Not Working as Expected

Post by DigitalVolcano »

This should work (with dot matches newline turned on). The only consideration is that it doesn't have any more <?xml phrases further down the file (shouldn't if valid?) else you may lose data.

Regex

Code: Select all

.*?<\?xml
Replace

Code: Select all

<?xml
Locked