Regex troubles in large HTML files

A place to try and solve your RegEx problems.
Post Reply
CornBread
Posts: 2
Joined: Thu Dec 20, 2012 9:47 pm

Regex troubles in large HTML files

Post by CornBread » Fri Dec 21, 2012 12:10 am

Hello, Ive been going through tons of HTML files from an old website of mine trying to strip out the data that I need and get it up on the new website that I have. I've gotten at least two months of work done in just a couple of days thanks to this amazing free program. Sadly, I've run into a bit of a roadblock. I hope someone can help me with this issue that I am having. I'm probably just not understanding something properly.

Basically, what I have is a massive file full of HTML code with some sections that I am needing to delete. These sections always start with the same thing, and always end with the same thing, the problem is that I have about 600 of these sections in one file and need to get rid of only what is between <a> and <b>. <a> and <b> appear many times in the file and I can't seem to get the program to pick up anything other than the first occurance of <a> to the last occurance of <b> and everything in between. Sometimes the program will just lock up indefinitely, probably due to the massive file size.

Here's what I've tried:
<a>(.*)<b>{1}
<a>(.*)<b>?
<a>(.*)<b>{0,1}
<a>(.*)<b>({1})
<a>(.*)(<b>?)
<a>(.*)(<b>{0,1})
(<a>)(.*)(?<!.*?\b<b>\b.*?)
and some more.

Can anyone tell me what it is that I am missing?
Thanks in advance!
Post Reply