Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 11-26-2011, 06:47 PM   #1
Corbett
Junior Member
Corbett began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Nov 2011
Device: Android phone
Clearing trash while converting.. finding with regular expressions

I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.

the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem...

First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after...

All my rows begins with <p> and end with </p>

Step 1, get the rows containing a number in the end.
<p.+\d</p>
Step 2, get the rows that begin with a number:
<p[^>]*>\d.+</p>
Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book)
Step 3 combine the above with |
(<p[^>]*>\d.+</p>)|(<p.+\d</p>)

Step 4 Now to find empty rows
<p[^>]*> </p>
Step 5. And i only want those that have a "empty" row before and after.
<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>

Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer

</p>\s+<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>\s+<p[^>]*>

(Step 7 - FAILED)
So i want to use that expression and replace with a single space-character.... Unfortunately i failed there...

I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated...

Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module?

But may be useful for those that are able to really replace things anyway :/

Last edited by Corbett; 11-26-2011 at 06:51 PM.
Corbett is offline   Reply With Quote
Old 11-26-2011, 06:53 PM   #2
Corbett
Junior Member
Corbett began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Nov 2011
Device: Android phone
Clearing trash while converting.. finding with regular expressions

I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.

the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem...

First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after...

All my rows begins with <p> and end with </p>

Step 1, get the rows containing a number in the end.
<p.+\d</p>
Step 2, get the rows that begin with a number:
<p[^>]*>\d.+</p>
Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book)
(<p[^>]*>\d.+</p>)|(<p.+\d</p>)

Now to find empty rows
<p[^>]*> </p>
And i only want those that have a "empty" row before and after.
<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>

So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer

</p>\s+<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>\s+<p[^>]*>

So i want to use that expression and replace with a single space-character.... Unfortunately i failed there...

I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated...

Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module?

But may be useful for those that are able to really replace things anyway :/
Corbett is offline   Reply With Quote
Advert
Old 11-26-2011, 07:16 PM   #3
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
It's generally easier to do this stuff when you're outside of Calibre, do the conversion to EPUB/HTMLZ and try get things to be reasonable, once that's done then take the markup and process it by hand in something easier - RegexBuddy works well, tho I'm sure there are free alternatives, I never found one with a good spread of features. Sigil is nice for EPUB editing, however the current release has some problems with regex (tho it's fixed for the next release! - no real preview however).

It's also pretty tricky to help without a sample to work from, if you provide that, I'm sure I can work out a better way (there's a few problems with the regex there that will most likely miss things).
Serpentine is offline   Reply With Quote
Old 11-26-2011, 08:06 PM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,804
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Moderator Notice
please do not fragment threads on the same topic. Use new reply or quote to the original thread
theducks is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expressions geormes Calibre 4 08-04-2011 07:09 AM
Regular Expressions littleezza Conversion 1 07-15-2011 11:52 AM
Another help with regular expressions encapuchado Library Management 6 06-21-2011 03:14 PM
Help with regular expressions jevonbrady Library Management 6 06-21-2011 10:16 AM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 11:04 AM


All times are GMT -4. The time now is 12:04 PM.


MobileRead.com is a privately owned, operated and funded community.