10-10-2010, 09:59 AM | #1 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
is it possible to remove ( from epub)...
leftover headers / footers with these alternating formats;
nn title chapter-title nn where nn is a 2 digit number, which maybe becomes 3 digits later on ? i.e. original book would have had page number + book title on alternate pages, with chapter title +page number on the other pages. not practical to do this via./rtf & word because of the ever changing actual page number can it be done via regex & do I need to go into & back out of some other format en route. if so, what's a suitable syntax please it seems to be quite a common left-over in books that have been though other conversion before I found them. just removing title nn would be a start, if the varying chapter names are too difficult to automate. I want to end up with mobi, I only have epub source. the default structure detection / header removal does not seem to shift this stuff ? 2 samples follow - where Mexico is book title &" the Spaniard" is a chapter title., & 32, 33 are page numbers: nb these will often appear mid sentence, depending where the original page break occurred. : .......will be most happy to accept,' the girl's mother quickly replied, 32 Mexico having no intention of leaving her daughter alone with any man ... .....fight of the second matador. The Spaniard 33 Dofia Raquel slapped her son's hand sharply and said, 'No .... |
10-10-2010, 10:42 AM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Yes, read up here:
http://calibre-ebook.com/user_manual/regexp.html |
Advert | |
|
10-10-2010, 10:50 AM | #3 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
a little play with the wizard indicates that this will work for title <p class="calibre1">[0-9]+ Mexico</p> but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ? I need something that takes out stuff like <p class="calibre1">The Cactus and the Maguey 11</p> where the text can be any phrase fragment which is followed by a number ? still, it's a start, thanks. inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through. I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either - Last edited by cybmole; 10-10-2010 at 11:10 AM. |
|
10-10-2010, 10:56 AM | #4 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
10-10-2010, 01:12 PM | #5 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Code:
<p\sclass="calibre1"?>([0-9]+\s*Mexico\s*|[a-zA-Z\s]*?\d+\s*</p> Though that regex could help you find all the chapter names for the safer thing: A MUCH safer regex would be to look for all the chapter names and just put the beginning of them in your pattern: Code:
<p\sclass="calibre1">([0-9]+\s*Mexico\s*|(The\sCactus|Start\s*of\s*Two|Start\s*of\s*Three).*?\d+\s*</p> |
|
Advert | |
|
10-10-2010, 01:54 PM | #6 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
that has been most helpful - thanks to all. I have learned a little regex & also leaned that my source is not that great.
here's what someone on another forum has to say about the michener sources: ( maybe I'll have to go get paper copies for some of these! ) Centennial' and 'Chesapeake' are very good and seem to have been professional lits. 'The Novel' is all one paragraph... 'Hawaii' seems to have been written completely in italics and is littered with page numbers. There seems to be something wrong with the lit file as its html is missing important elements. Calibre is unable to read the pdb version. 'The Bridges at Toko-Ri' is reasonably good. 'Space' is the result of an automated conversion that didn't really work. 'Recessional' has the odd error and has lost its structure, but is otherwise readable. 'Poland' is another automated conversion that hasn't been cleaned-up. 'The Covenant', 'Legacy', 'The Source' and 'Mexico' appear to be the raw output of OCR scans and are full of errors. Someone with the original text to hand would need to do a lot of work on these before they were remotely readable..... |
10-10-2010, 07:05 PM | #7 |
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
We're always glad to help anyone out, but there is no need to discuss on MobileRead how bad pirated source files may or may not be.
|
10-11-2010, 05:01 AM | #8 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
I was just pointing out A) that there are NO legal sources of these ebooks B ) what sources there are, are crap if that offends your sensibilities then have a mod remove the post, or even the whole thread. but please - this smacks of hypocrisy - here's a free definition. http://en.wikipedia.org/wiki/Hypocrisy - if everyone here did nothing but read & store their DRM'd purchases, there would be no need for calibre's conversion tools & no need for any how to discussions now feel free to have the last word - point me at the forum rules , whatever & I'll shut up. |
|
10-11-2010, 06:28 AM | #9 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
And this has nothing to do with hypocrisy- I have several legal PDF files, that, were I to convert them, would require heavy use of regular expressions to remove junk. |
|
10-11-2010, 11:13 AM | #10 |
Well trained by Cats
Posts: 29,817
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
There are plenty of legal reasons to use Calibre Conversions.
Your reader R bit the dust and they no longer make Brand R which used a proprietary format. You get brand A which uses another (almost) proprietary format. Enter Calibre. which allows you to convert books YOU BOUGHT (not this license BS they have today). |
10-11-2010, 11:35 AM | #11 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently. they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco) |
10-11-2010, 11:42 AM | #12 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
|
|
10-11-2010, 11:46 AM | #13 |
creator of calibre
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Jeez it's not like calibre is used only for purchased books. I use it to convert my personal documents all the time.
|
10-11-2010, 11:57 AM | #14 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Remove underline from links in epub | Amalthia | Calibre | 6 | 02-10-2014 08:41 AM |
How To Remove White Border From Epub Cover | crestfalleen | Calibre | 13 | 05-25-2010 12:21 PM |
LRF to ePUB -- Remove Repeating Text | mshneour | Calibre | 14 | 05-03-2010 11:00 PM |
remove drm from epub | macgeek21 | ePub | 10 | 01-26-2010 01:17 PM |
PDB to epub, remove drm? | Calybrid | Calibre | 5 | 01-09-2010 11:26 PM |