MobileRead Forums - View Single Post - Clearing trash while converting.. finding with regular expressions

Corbett · 11-26-2011, 06:47 PM

I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.

the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem...

First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after...

All my rows begins with and end with 

Step 1, get the rows containing a number in the end.
<p.+\d
Step 2, get the rows that begin with a number:
<p[^>]*>\d.+
Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book)
Step 3 combine the above with |
(<p[^>]*>\d.+)|(<p.+\d)

Step 4 Now to find empty rows
<p[^>]*> 
Step 5. And i only want those that have a "empty" row before and after.
<p[^>]*> \s+((<p[^>]*>\d.+)|(<p.+\d))\s+<p[^>]*> 

Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer

\s+<p[^>]*> \s+((<p[^>]*>\d.+)|(<p.+\d))\s+<p[^>]*> \s+<p[^>]*>

(Step 7 - FAILED)
So i want to use that expression and replace with a single space-character.... Unfortunately i failed there...

I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated...

Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module?

But may be useful for those that are able to really replace things anyway :/

11-26-2011, 06:47 PM	#1
Corbett Junior Member Posts: 2 Karma: 10 Join Date: Nov 2011 Device: Android phone	Clearing trash while converting.. finding with regular expressions I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre. the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem... First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after... All my rows begins with <p> and end with </p> Step 1, get the rows containing a number in the end. <p.+\d</p> Step 2, get the rows that begin with a number: <p[^>]>\d.+</p> Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book) Step 3 combine the above with \| (<p[^>]>\d.+</p>)\|(<p.+\d</p>) Step 4 Now to find empty rows <p[^>]> </p> Step 5. And i only want those that have a "empty" row before and after. <p[^>]> </p>\s+((<p[^>]>\d.+</p>)\|(<p.+\d</p>))\s+<p[^>]> </p> Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer </p>\s+<p[^>]> </p>\s+((<p[^>]>\d.+</p>)\|(<p.+\d</p>))\s+<p[^>]> </p>\s+<p[^>]> (Step 7 - FAILED) So i want to use that expression and replace with a single space-character.... Unfortunately i failed there... I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated... Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module? But may be useful for those that are able to really replace things anyway :/ Last edited by Corbett; 11-26-2011 at 06:51 PM.