Clearing trash while converting.. finding with regular expressions

Corbett · 11-26-2011, 06:47 PM

I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.

the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem...

First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after...

All my rows begins with and end with 

Step 1, get the rows containing a number in the end.
<p.+\d
Step 2, get the rows that begin with a number:
<p[^>]*>\d.+
Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book)
Step 3 combine the above with |
(<p[^>]*>\d.+)|(<p.+\d)

Step 4 Now to find empty rows
<p[^>]*> 
Step 5. And i only want those that have a "empty" row before and after.
<p[^>]*> \s+((<p[^>]*>\d.+)|(<p.+\d))\s+<p[^>]*> 

Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer

\s+<p[^>]*> \s+((<p[^>]*>\d.+)|(<p.+\d))\s+<p[^>]*> \s+<p[^>]*>

(Step 7 - FAILED)
So i want to use that expression and replace with a single space-character.... Unfortunately i failed there...

I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated...

Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module?

But may be useful for those that are able to really replace things anyway :/

Corbett · 11-26-2011, 06:53 PM

I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.

the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem...

First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after...

All my rows begins with and end with 

Step 1, get the rows containing a number in the end.
<p.+\d
Step 2, get the rows that begin with a number:
<p[^>]*>\d.+
Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book)
(<p[^>]*>\d.+)|(<p.+\d)

Now to find empty rows
<p[^>]*> 
And i only want those that have a "empty" row before and after.
<p[^>]*> \s+((<p[^>]*>\d.+)|(<p.+\d))\s+<p[^>]*> 

So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer

\s+<p[^>]*> \s+((<p[^>]*>\d.+)|(<p.+\d))\s+<p[^>]*> \s+<p[^>]*>

So i want to use that expression and replace with a single space-character.... Unfortunately i failed there...

I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated...

Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module?

But may be useful for those that are able to really replace things anyway :/

Serpentine · 11-26-2011, 07:16 PM

It's generally easier to do this stuff when you're outside of Calibre, do the conversion to EPUB/HTMLZ and try get things to be reasonable, once that's done then take the markup and process it by hand in something easier - RegexBuddy works well, tho I'm sure there are free alternatives, I never found one with a good spread of features. Sigil is nice for EPUB editing, however the current release has some problems with regex (tho it's fixed for the next release! - no real preview however).

It's also pretty tricky to help without a sample to work from, if you provide that, I'm sure I can work out a better way (there's a few problems with the regex there that will most likely miss things).

theducks · 11-26-2011, 08:06 PM

Moderator Notice
please do not fragment threads on the same topic. Use new reply or quote to the original thread

11-26-2011, 06:47 PM	#1
Corbett Junior Member Posts: 2 Karma: 10 Join Date: Nov 2011 Device: Android phone	Clearing trash while converting.. finding with regular expressions I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre. the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem... First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after... All my rows begins with <p> and end with </p> Step 1, get the rows containing a number in the end. <p.+\d</p> Step 2, get the rows that begin with a number: <p[^>]>\d.+</p> Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book) Step 3 combine the above with \| (<p[^>]>\d.+</p>)\|(<p.+\d</p>) Step 4 Now to find empty rows <p[^>]> </p> Step 5. And i only want those that have a "empty" row before and after. <p[^>]> </p>\s+((<p[^>]>\d.+</p>)\|(<p.+\d</p>))\s+<p[^>]> </p> Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer </p>\s+<p[^>]> </p>\s+((<p[^>]>\d.+</p>)\|(<p.+\d</p>))\s+<p[^>]> </p>\s+<p[^>]> (Step 7 - FAILED) So i want to use that expression and replace with a single space-character.... Unfortunately i failed there... I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated... Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module? But may be useful for those that are able to really replace things anyway :/ Last edited by Corbett; 11-26-2011 at 06:51 PM.

11-26-2011, 06:53 PM	#2
Corbett Junior Member Posts: 2 Karma: 10 Join Date: Nov 2011 Device: Android phone	Clearing trash while converting.. finding with regular expressions I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre. the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem... First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after... All my rows begins with <p> and end with </p> Step 1, get the rows containing a number in the end. <p.+\d</p> Step 2, get the rows that begin with a number: <p[^>]>\d.+</p> Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book) (<p[^>]>\d.+</p>)\|(<p.+\d</p>) Now to find empty rows <p[^>]> </p> And i only want those that have a "empty" row before and after. <p[^>]> </p>\s+((<p[^>]>\d.+</p>)\|(<p.+\d</p>))\s+<p[^>]> </p> So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer </p>\s+<p[^>]> </p>\s+((<p[^>]>\d.+</p>)\|(<p.+\d</p>))\s+<p[^>]> </p>\s+<p[^>]> So i want to use that expression and replace with a single space-character.... Unfortunately i failed there... I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated... Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module? But may be useful for those that are able to really replace things anyway :/

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regular Expressions	geormes	Calibre	4	08-04-2011 07:09 AM
Regular Expressions	littleezza	Conversion	1	07-15-2011 11:52 AM
Another help with regular expressions	encapuchado	Library Management	6	06-21-2011 03:14 PM
Help with regular expressions	jevonbrady	Library Management	6	06-21-2011 10:16 AM
Help with Regular Expressions	ghostyjack	Workshop	2	01-08-2010 11:04 AM

11-26-2011, 07:16 PM	#3
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	It's generally easier to do this stuff when you're outside of Calibre, do the conversion to EPUB/HTMLZ and try get things to be reasonable, once that's done then take the markup and process it by hand in something easier - RegexBuddy works well, tho I'm sure there are free alternatives, I never found one with a good spread of features. Sigil is nice for EPUB editing, however the current release has some problems with regex (tho it's fixed for the next release! - no real preview however). It's also pretty tricky to help without a sample to work from, if you provide that, I'm sure I can work out a better way (there's a few problems with the regex there that will most likely miss things).

11-26-2011, 08:06 PM	#4
theducks Well trained by Cats Posts: 29,804 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Moderator Notice please do not fragment threads on the same topic. Use new reply or quote to the original thread

Advert