Clearing trash while converting.. finding with regular expressions
I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.
the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem...
First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after...
All my rows begins with <p> and end with </p>
Step 1, get the rows containing a number in the end.
<p.+\d</p>
Step 2, get the rows that begin with a number:
<p[^>]*>\d.+</p>
Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book)
Step 3 combine the above with |
(<p[^>]*>\d.+</p>)|(<p.+\d</p>)
Step 4 Now to find empty rows
<p[^>]*> </p>
Step 5. And i only want those that have a "empty" row before and after.
<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>
Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer
</p>\s+<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>\s+<p[^>]*>
(Step 7 - FAILED)
So i want to use that expression and replace with a single space-character.... Unfortunately i failed there...
I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated...
Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module?
But may be useful for those that are able to really replace things anyway :/
Last edited by Corbett; 11-26-2011 at 06:51 PM.
|