View Single Post
Old 09-19-2010, 08:35 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You need to be careful about deleting everything between <p> and </p> tags. In that particular example book if you did that you would delete actual book text in addition to the headers.

While it's generally a good idea to always try to remove both the opening and closing tags, the only format I think that's critical for is epub. Calibre will force the files into xhtml spec if it discovers they're out of spec. (I think for epub it assumes they're in spec, so you could really screw up epub)

generally .*? is better than .*, and will usually do what users actually want it to. I'd use ? instead of *? to make something optional.

You can think of brackets [] as single character groupings, but for string groupings use parentheses and |
(one|two|three|four)

A few other useful expressions:
Matching p tags with any styles/ids:
<p[^>]*>

Never specify actual spaces in your regular expression. Use \s, which tells regex to look for a space. Better yet use \s+ or \s*, which match one or more spaces or zero or more spaces respectively. I make liberal use of \s* in my expressions because you never know when a stray space will hurt you. \s* also has the benefit of passing through any whitespace including tabs and carriage returns. So when you really do need to match everything between <p></p>, except your opening and closing tags are across lines, you can use \s* to get you there.

Last edited by ldolse; 09-23-2010 at 03:12 PM.
ldolse is offline