MobileRead Forums - View Single Post - Regular expressions, Calibre and you- an introduction (Archived)

ldolse · 09-19-2010, 08:35 PM

You need to be careful about deleting everything between <p> and </p> tags. In that particular example book if you did that you would delete actual book text in addition to the headers.

While it's generally a good idea to always try to remove both the opening and closing tags, the only format I think that's critical for is epub. Calibre will force the files into xhtml spec if it discovers they're out of spec. (I think for epub it assumes they're in spec, so you could really screw up epub)

generally .*? is better than .*, and will usually do what users actually want it to. I'd use ? instead of *? to make something optional.

You can think of brackets [] as single character groupings, but for string groupings use parentheses and |
(one|two|three|four)

A few other useful expressions:
Matching p tags with any styles/ids:
<p[^>]*>

Never specify actual spaces in your regular expression. Use \s, which tells regex to look for a space. Better yet use \s+ or \s*, which match one or more spaces or zero or more spaces respectively. I make liberal use of \s* in my expressions because you never know when a stray space will hurt you. \s* also has the benefit of passing through any whitespace including tabs and carriage returns. So when you really do need to match everything between <p></p>, except your opening and closing tags are across lines, you can use \s* to get you there.

09-19-2010, 08:35 PM	#2
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You need to be careful about deleting everything between <p> and </p> tags. In that particular example book if you did that you would delete actual book text in addition to the headers. While it's generally a good idea to always try to remove both the opening and closing tags, the only format I think that's critical for is epub. Calibre will force the files into xhtml spec if it discovers they're out of spec. (I think for epub it assumes they're in spec, so you could really screw up epub) generally .? is better than ., and will usually do what users actually want it to. I'd use ? instead of ? to make something optional. You can think of brackets [] as single character groupings, but for string groupings use parentheses and \| (one\|two\|three\|four) A few other useful expressions: Matching p tags with any styles/ids: <p[^>]> Never specify actual spaces in your regular expression. Use \s, which tells regex to look for a space. Better yet use \s+ or \s, which match one or more spaces or zero or more spaces respectively. I make liberal use of \s in my expressions because you never know when a stray space will hurt you. \s* also has the benefit of passing through any whitespace including tabs and carriage returns. So when you really do need to match everything between <p></p>, except your opening and closing tags are across lines, you can use \s* to get you there. Last edited by ldolse; 09-23-2010 at 03:12 PM.