removing lines that do not match criteria

Mike B · Post by **Mike B** » Sun Feb 20, 2011 7:51 pm

Hello

I'm very much a novice and would appreciate some help.

I have a number of text files, each containing multiple file paths on separate lines. Each file path is delimited by <filename> and </filename>. In addition to the file paths, each text file contains a number of other lines of text which I wish to delete completely.

I would like to scan through all my text files, automatically deleting any lines of text that don't begin with <filename>, so just leaving the file paths.

If possible, I wish to remove the delimiting words <filename> and </filename> as well.

How do I start to go about this?

Any help appreciated.

Regards

Mike B

Fool4UAnyway · Post by **Fool4UAnyway** » Sun Feb 20, 2011 10:43 pm

Let's turn the world around... Or at least, this request.

Would it be correct to say you _only_ want to _keep_ the file paths?

In that case, you may just use the Extract button to get those.

You could start to just get (keep) only the <filename>...</filename> lines:

<filename>[^>]+</filename>\r\n

You can then strip the <filename> tags using the following regex, if you have saved the list to a new file.

Find:
</?filename>

Replace by: (leave blank)

Then you should have your list of file paths.

Fool4UAnyway · Post by **Fool4UAnyway** » Sun Feb 20, 2011 11:59 pm

If you want to keep those file paths separate in their original files, you may want to remove all the text from the files first.

Find:
\r\n[^<].*

Replace by: (leave empty)

This will remove any line not starting with <.

You can then remove the </?filename> tags as mentioned before.

Mike B · Post by **Mike B** » Mon Feb 21, 2011 8:41 am

Thanks for the helpful and quick replies.

I want to keep file paths separate in their original files, so I can use the second solution first, then remove the 'filename' tags as suggested.

Thanks very much.

Regards

Mike B

Mike B · Post by **Mike B** » Mon Feb 21, 2011 3:43 pm

Hello again

am actually having trouble getting this to work. Don't know if I doing something wrong with TextCrawler or what?

Have been plugging in the suggested reg ex in the reg ex field. ie.

Find:
\r\n[^<].*

When I hit the Find button I get no matches! None at all!

Actually every line of every file begins with the character <, so, the suggested expression should find something...

What I guess I really need is to delete every line that doesn't begin with <filename>.

Would be grateful for any further help.

Regards

Mike B

Fool4UAnyway · Post by **Fool4UAnyway** » Mon Feb 21, 2011 8:42 pm

First this: [^>] means accept any character that is NOT > so this will work only on lines that do NOT begin with >.

What are we gonna do now?
We are just gonna change the order.

1. Remove all filename tags around the filepaths
2. Remove all remaining lines with tags

1. Remove all filename tags around the filepaths

This will make the lines different from the other lines starting with a tag.

Find:
<filename>([^>]+)</filename>\r\n

Replace:
$1

$1 references the first ( ) group, re-placing what was found using the regular expression between parentheses, i.e. the filepath.

2. Remove all lines with tags

The lines that have not been processed still start with a tag and you do want to remove all of them now.

Find:
<.*\r\n

Replace: (leave empty)

Mike B · Post by **Mike B** » Tue Feb 22, 2011 9:16 am

Thank you for your patience and further help.

That's a great explanation for me - I am starting to get the logic now and have success with what you suggest using the TextCrawler Regular Expression Tester.

For some reason, tho, I am unable to get this to work with TextCrawler itself. Am hoping it's simply an incorrect setting but can't see it at the moment.

Will let you know if I sort it out.

Regards

Mike B

Fool4UAnyway · Post by **Fool4UAnyway** » Tue Feb 22, 2011 9:19 pm

You may post your actual results, or try to load a single file into the Regular Expression Tester and see if that will work.

Perhaps your files have another newline character than the Windows \r\n (CR LF = Carriage Return, Line Feed). Other systems (Unix, Max) only use one of the two.

Mike B · Post by **Mike B** » Fri Feb 25, 2011 2:17 pm

OK, am back on it now.

Have found a 2 pass solution that almost works - just leaves an empty line at beginning and end of each text file:

Pass 1:

Find:
<filename>([^>]+)</filename>

Replace by:
$1

Pass 2:

Find:
\s*<.+

Replace by:(leave empty)

Can work with this, tho it would be nice to have a solution that removed the blank lines as well.

As suggested, did try:

Find:<filename>([^>]+)</filename>\r\n --> no matches

Find:<filename>([^>]+)</filename>\r --> no matches

Find:<filename>([^>]+)</filename>\n --> matches found OK but stripped line endings to give multiple file paths all in one very long line.

Also, did find your two pass solution worked OK in the Reg Expression Tester but not in TextCrawler:

Find:<filename>([^>]+)</filename>\r\n

Replace:$1

then

Find:<.*\r\n

Replace:(leave empty)

Don't know what's going on there...

Fool4UAnyway · Post by **Fool4UAnyway** » Fri Feb 25, 2011 9:48 pm

Find:
<filename>([^>]+)</filename>

Replace:
$1

You may remove the newline character(s) from the filepath lines completely... You just want to remove the tags from those lines and keep them all on their own line.

<.*\r\n searches for < and then anything else on that line (up to) and including the newline character. Replacing it by nothing means the line (if < is at the start) will be completely removed.