Extracting Text from files

Shawn · Post by **Shawn** » Wed Feb 24, 2010 5:10 pm

How would I be able to have a starting point and ending point and grab all in beetween? Example : <noscript> Wanted Information </noscript> ...

Thanks again for making such a great program.

Shawn · Post by **Shawn** » Fri Feb 26, 2010 2:30 am

Well I'm guessing this is something that can't be done with Text Crawler I'll try to find a different solution / program I have over 1,000,000 files to deal with... I'll try with a macro recorder...

Good Day to you all...

dv2 · Post by **dv2** » Fri Feb 26, 2010 9:09 am

You can do this with TC - sorry for the slow response!

In 'Regular expressions' mode try
<noscript>[\s\S]*</noscript>

Then 'extract'. This will grab everything between (and including) the tags. You can always remove the tags afterwards with another search/replace op.

Shawn · Post by **Shawn** » Fri Feb 26, 2010 10:46 am

My Appologies for my impatience I just wanted to get the files done as soon as possible due that the data we're recipes going to be used for a community kitchen program (To get people to eat different meals without busting a budjet)

Thanks again.

Fool4UAnyway · Post by **Fool4UAnyway** » Fri Feb 26, 2010 11:26 pm

A non-regex solution (for the macro) could be like this:

1. Search for the opening tag and put a linebreak character _before_ it.
2. Search for the closing tag.
3. Remove any linebreak characters between the two tags. Perform a replace on a selection from the closing tag to the opening tag.
4. Add a linebreak character _after_ the closing tag.

Now, all wanted information should be on individual lines, starting with the opening tag and ending with the closing tag.

5. Sort the lines (hey, there are a million...)
6. Keep the lines starting with the opening tag. They are sorted then.

If you do not want the lines to be sorted, you could perform a regex replace emptying all lines that do not contain the noscript tag.

Shawn · Post by **Shawn** » Sun Feb 28, 2010 12:19 am

@Fool4UAnyway

Thanks for the Macro tip however placing all the items on one line wouldn't work with this table as the items are all in order I just had WAY too many junk lines in the files due to crappy exportation in the first place.

P.S: The Database is done.. Thank you all for your time.

Ray · Post by **Ray** » Sun Apr 17, 2011 12:39 am

I'm using TC 2.0. I've got a bunch of ANSI text files that all begin with "FROM: [username]," where username varies. For example, one is an email whose first line is "FROM: Joe." I want to extract just the username. When I use "FROM: [\s\S]*\n" in Regular Expression mode and then click Find, it highlights the entire contents of the file in yellow. Isn't \n the right way to indicate that I want everything up to the next newline? Same thing if I use \r instead. ... Looks like a great program, my first time using ... wish me luck!