11-26-2011, 06:47 PM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Nov 2011
Device: Android phone
|
Clearing trash while converting.. finding with regular expressions
I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.
the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem... First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after... All my rows begins with <p> and end with </p> Step 1, get the rows containing a number in the end. <p.+\d</p> Step 2, get the rows that begin with a number: <p[^>]*>\d.+</p> Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book) Step 3 combine the above with | (<p[^>]*>\d.+</p>)|(<p.+\d</p>) Step 4 Now to find empty rows <p[^>]*> </p> Step 5. And i only want those that have a "empty" row before and after. <p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p> Step 6. So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer </p>\s+<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>\s+<p[^>]*> (Step 7 - FAILED) So i want to use that expression and replace with a single space-character.... Unfortunately i failed there... I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated... Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module? But may be useful for those that are able to really replace things anyway :/ Last edited by Corbett; 11-26-2011 at 06:51 PM. |
11-26-2011, 06:53 PM | #2 |
Junior Member
Posts: 2
Karma: 10
Join Date: Nov 2011
Device: Android phone
|
Clearing trash while converting.. finding with regular expressions
I spent a few hours trying to find out how to get rid of my scanned pageheaders (without doing any manual work) in a LIT-file when converting to ePub using my Calibre.
the guides i have read focus on regular expressions and how to use them, but that wasnt my big problem... First i looked in the preview, and saw a pattern. All header-rows had either a pagenumber in the beginning or the end, and (almost) always had blank lines before and after... All my rows begins with <p> and end with </p> Step 1, get the rows containing a number in the end. <p.+\d</p> Step 2, get the rows that begin with a number: <p[^>]*>\d.+</p> Ok, im lazy, so i want to combine them to weed out false positives (when i got numbers in the usual text from the book) (<p[^>]*>\d.+</p>)|(<p.+\d</p>) Now to find empty rows <p[^>]*> </p> And i only want those that have a "empty" row before and after. <p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p> So i got rid of my false positives, but finally i want to get rid of the linebreak before and after, to get the textflow a bit nicer </p>\s+<p[^>]*> </p>\s+((<p[^>]*>\d.+</p>)|(<p.+\d</p>))\s+<p[^>]*> </p>\s+<p[^>]*> So i want to use that expression and replace with a single space-character.... Unfortunately i failed there... I didnt get my book converted with the wasted lines deleted for some reason, becase for some reason the search-replace wasnt activated... Might it be so that Search-Replace is done after conversion, so i have to guess how the converted document looks when i search and replace in calibres conversion module? But may be useful for those that are able to really replace things anyway :/ |
Advert | |
|
11-26-2011, 07:16 PM | #3 |
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
It's generally easier to do this stuff when you're outside of Calibre, do the conversion to EPUB/HTMLZ and try get things to be reasonable, once that's done then take the markup and process it by hand in something easier - RegexBuddy works well, tho I'm sure there are free alternatives, I never found one with a good spread of features. Sigil is nice for EPUB editing, however the current release has some problems with regex (tho it's fixed for the next release! - no real preview however).
It's also pretty tricky to help without a sample to work from, if you provide that, I'm sure I can work out a better way (there's a few problems with the regex there that will most likely miss things). |
11-26-2011, 08:06 PM | #4 |
Well trained by Cats
Posts: 29,804
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Moderator Notice
please do not fragment threads on the same topic. Use new reply or quote to the original thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expressions | geormes | Calibre | 4 | 08-04-2011 07:09 AM |
Regular Expressions | littleezza | Conversion | 1 | 07-15-2011 11:52 AM |
Another help with regular expressions | encapuchado | Library Management | 6 | 06-21-2011 03:14 PM |
Help with regular expressions | jevonbrady | Library Management | 6 | 06-21-2011 10:16 AM |
Help with Regular Expressions | ghostyjack | Workshop | 2 | 01-08-2010 11:04 AM |