Page 1 of 1

Extracting Text from files

Posted: Wed Feb 24, 2010 5:10 pm
by Shawn
How would I be able to have a starting point and ending point and grab all in beetween? Example : <noscript> Wanted Information </noscript> ...

Thanks again for making such a great program.

Posted: Fri Feb 26, 2010 2:30 am
by Shawn
Well I'm guessing this is something that can't be done with Text Crawler I'll try to find a different solution / program I have over 1,000,000 files to deal with... I'll try with a macro recorder...

Good Day to you all...

Posted: Fri Feb 26, 2010 9:09 am
by dv2
You can do this with TC - sorry for the slow response!

In 'Regular expressions' mode try
<noscript>[\s\S]*</noscript>

Then 'extract'. This will grab everything between (and including) the tags. You can always remove the tags afterwards with another search/replace op.

Posted: Fri Feb 26, 2010 10:46 am
by Shawn
My Appologies for my impatience I just wanted to get the files done as soon as possible due that the data we're recipes going to be used for a community kitchen program (To get people to eat different meals without busting a budjet)

Thanks again.


Posted: Fri Feb 26, 2010 11:26 pm
by Fool4UAnyway
A non-regex solution (for the macro) could be like this:

1. Search for the opening tag and put a linebreak character _before_ it.
2. Search for the closing tag.
3. Remove any linebreak characters between the two tags. Perform a replace on a selection from the closing tag to the opening tag.
4. Add a linebreak character _after_ the closing tag.

Now, all wanted information should be on individual lines, starting with the opening tag and ending with the closing tag.

5. Sort the lines (hey, there are a million...)
6. Keep the lines starting with the opening tag. They are sorted then.

If you do not want the lines to be sorted, you could perform a regex replace emptying all lines that do not contain the noscript tag.

Posted: Sun Feb 28, 2010 12:19 am
by Shawn
@Fool4UAnyway

Thanks for the Macro tip however placing all the items on one line wouldn't work with this table as the items are all in order I just had WAY too many junk lines in the files due to crappy exportation in the first place.

P.S: The Database is done.. Thank you all for your time.

Posted: Sun Apr 17, 2011 12:39 am
by Ray
I'm using TC 2.0. I've got a bunch of ANSI text files that all begin with "FROM: [username]," where username varies. For example, one is an email whose first line is "FROM: Joe." I want to extract just the username. When I use "FROM: [\s\S]*\n" in Regular Expression mode and then click Find, it highlights the entire contents of the file in yellow. Isn't \n the right way to indicate that I want everything up to the next newline? Same thing if I use \r instead. ... Looks like a great program, my first time using ... wish me luck!