is it possible to remove ( from epub)...

cybmole · 10-10-2010, 09:59 AM

leftover headers / footers with these alternating formats;

nn title
chapter-title nn

where nn is a 2 digit number, which maybe becomes 3 digits later on ?

i.e. original book would have had page number + book title on alternate pages, with chapter title +page number on the other pages.

not practical to do this via./rtf & word because of the ever changing actual page number

can it be done via regex & do I need to go into & back out of some other format en route. if so, what's a suitable syntax please

it seems to be quite a common left-over in books that have been though other conversion before I found them.

just removing title nn would be a start, if the varying chapter names are too difficult to automate.

I want to end up with mobi, I only have epub source. the default structure detection / header removal does not seem to shift this stuff ?

2 samples follow - where Mexico is book title &" the Spaniard" is a chapter title., & 32, 33 are page numbers:

nb these will often appear mid sentence, depending where the original page break occurred.
:
.......will be most happy to accept,' the girl's mother quickly replied,
32 Mexico
having no intention of leaving her daughter alone with any man ...

.....fight of the second matador.
The Spaniard 33
Dofia Raquel slapped her son's hand sharply and said, 'No ....

ldolse · 10-10-2010, 10:42 AM

Yes, read up here:
http://calibre-ebook.com/user_manual/regexp.html

cybmole · 10-10-2010, 10:50 AM

Quote:

Originally Posted by ldolse

Yes, read up here:
http://calibre-ebook.com/user_manual/regexp.html

ok -

a little play with the wizard indicates that this will work for title
[0-9]+ Mexico

but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
The Cactus and the Maguey 11 where the text can be any phrase fragment which is followed by a number ?

still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -

Starson17 · 10-10-2010, 10:56 AM

Quote:

Originally Posted by cybmole

ok - but if I just go epub to mobi, are the regex thingies applied during the conversion ?

Yes, if the Remove Header/Footer option is turned on (and the correct regex is used). See Structure Detection and the Remove Header/Footer options.

ldolse · 10-10-2010, 01:12 PM

Quote:

Originally Posted by cybmole

ok -

a little play with the wizard indicates that this will work for title
[0-9]+ Mexico

but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
The Cactus and the Maguey 11 where the text can be any phrase fragment which is followed by a number ?

still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -

The tutorial does cover your scenario, but since you went to the effort of reading it here's your answer. You probably really don't want to do 'any phrase' followed by a number, I wouldn't risk it (too big a chance to remove real content), but you would want a regex that looks like this:

Code:

<p\sclass="calibre1"?>([0-9]+\s*Mexico\s*|[a-zA-Z\s]*?\d+\s*</p>

Be sure to check every single match in the wizard if you do that to make sure it doesn't overmatch. With a Michener book you might be checking for a while....

Though that regex could help you find all the chapter names for the safer thing:

A MUCH safer regex would be to look for all the chapter names and just put the beginning of them in your pattern:

Code:

<p\sclass="calibre1">([0-9]+\s*Mexico\s*|(The\sCactus|Start\s*of\s*Two|Start\s*of\s*Three).*?\d+\s*</p>

You can see there where the starting words of each chapter is separated by | and surrounded by parentheses.

cybmole · 10-10-2010, 01:54 PM

that has been most helpful - thanks to all. I have learned a little regex & also leaned that my source is not that great.
here's what someone on another forum has to say about the michener sources: ( maybe I'll have to go get paper copies for some of these! )

Centennial' and 'Chesapeake' are very good and seem to have been professional lits.
'The Novel' is all one paragraph...
'Hawaii' seems to have been written completely in italics and is littered with page numbers. There seems to be something wrong with the lit file as its html is missing important elements. Calibre is unable to read the pdb version.
'The Bridges at Toko-Ri' is reasonably good.
'Space' is the result of an automated conversion that didn't really work.
'Recessional' has the odd error and has lost its structure, but is otherwise readable.
'Poland' is another automated conversion that hasn't been cleaned-up.
'The Covenant', 'Legacy', 'The Source' and 'Mexico' appear to be the raw output of OCR scans and are full of errors. Someone with the original text to hand would need to do a lot of work on these before they were remotely readable.....

DoctorOhh · 10-10-2010, 07:05 PM

Quote:

Originally Posted by cybmole

here's what someone on another forum has to say about the michener sources: ( maybe I'll have to go get paper copies for some of these! )

We're always glad to help anyone out, but there is no need to discuss on MobileRead how bad pirated source files may or may not be.

cybmole · 10-11-2010, 05:01 AM

Quote:

Originally Posted by dwanthny

We're always glad to help anyone out, but there is no need to discuss on MobileRead how bad pirated source files may or may not be.

techy -

I was just pointing out
A) that there are NO legal sources of these ebooks
B ) what sources there are, are crap

if that offends your sensibilities then have a mod remove the post, or even the whole thread.

but please - this smacks of hypocrisy - here's a free definition. http://en.wikipedia.org/wiki/Hypocrisy

- if everyone here did nothing but read & store their DRM'd purchases, there would be no need for calibre's conversion tools & no need for any how to discussions

now feel free to have the last word - point me at the forum rules , whatever & I'll shut up.

Manichean · 10-11-2010, 06:28 AM

Quote:

Originally Posted by cybmole

techy -

I was just pointing out
A) that there are NO legal sources of these ebooks
B ) what sources there are, are crap

if that offends your sensibilities then have a mod remove the post, or even the whole thread.

but please - this smacks of hypocrisy - here's a free definition. http://en.wikipedia.org/wiki/Hypocrisy

- if everyone here did nothing but read & store their DRM'd purchases, there would be no need for calibre's conversion tools & no need for any how to discussions

now feel free to have the last word - point me at the forum rules , whatever & I'll shut up.

Shutting up might be a good idea. Discussing DRM removal and/or illegal (or gray area) book sources is generally not looked kindly upon. If I remember correctly, it is, in fact, in the forum rules.
And this has nothing to do with hypocrisy- I have several legal PDF files, that, were I to convert them, would require heavy use of regular expressions to remove junk.

theducks · 10-11-2010, 11:13 AM

There are plenty of legal reasons to use Calibre Conversions.
Your reader R bit the dust and they no longer make Brand R which used a proprietary format. You get brand A which uses another (almost) proprietary format.
Enter Calibre. which allows you to convert books YOU BOUGHT (not this license BS they have today).

cybmole · 10-11-2010, 11:35 AM

well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently.
they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)

Manichean · 10-11-2010, 11:42 AM

Quote:

Originally Posted by cybmole

well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently.
they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)

That depends on where you live. For example, converting anything is legal here in Germany, as long as you don't circumvent any "effective technical deterrents". (At least, AFAICR. And your value of "effective" may vary...)

kovidgoyal · 10-11-2010, 11:46 AM

Jeez it's not like calibre is used only for purchased books. I use it to convert my personal documents all the time.

Starson17 · 10-11-2010, 11:57 AM

Quote:

Originally Posted by cybmole

well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently.
they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)

And then the EFF will step in and argue that format shifting is just as legal for ebooks as time shifting was for the Sony Betamax VCR and both sides will spend gobs of money going to the Supreme Court (for U.S. citizens) or to wherever (in other countries) and arguing about fair use and other legal niceties. In the end, each user has to decide for him/herself whether what he/she's doing is legal/moral, and then accept the consequences of that decision.

10-10-2010, 09:59 AM	#1
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	is it possible to remove ( from epub)... leftover headers / footers with these alternating formats; nn title chapter-title nn where nn is a 2 digit number, which maybe becomes 3 digits later on ? i.e. original book would have had page number + book title on alternate pages, with chapter title +page number on the other pages. not practical to do this via./rtf & word because of the ever changing actual page number can it be done via regex & do I need to go into & back out of some other format en route. if so, what's a suitable syntax please it seems to be quite a common left-over in books that have been though other conversion before I found them. just removing title nn would be a start, if the varying chapter names are too difficult to automate. I want to end up with mobi, I only have epub source. the default structure detection / header removal does not seem to shift this stuff ? 2 samples follow - where Mexico is book title &" the Spaniard" is a chapter title., & 32, 33 are page numbers: nb these will often appear mid sentence, depending where the original page break occurred. : .......will be most happy to accept,' the girl's mother quickly replied, 32 Mexico having no intention of leaving her daughter alone with any man ... .....fight of the second matador. The Spaniard 33 Dofia Raquel slapped her son's hand sharply and said, 'No ....

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Remove underline from links in epub	Amalthia	Calibre	6	02-10-2014 08:41 AM
How To Remove White Border From Epub Cover	crestfalleen	Calibre	13	05-25-2010 12:21 PM
LRF to ePUB -- Remove Repeating Text	mshneour	Calibre	14	05-03-2010 11:00 PM
remove drm from epub	macgeek21	ePub	10	01-26-2010 01:17 PM
PDB to epub, remove drm?	Calybrid	Calibre	5	01-09-2010 11:26 PM

10-10-2010, 10:42 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Yes, read up here: http://calibre-ebook.com/user_manual/regexp.html

10-10-2010, 01:54 PM	#6
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	that has been most helpful - thanks to all. I have learned a little regex & also leaned that my source is not that great. here's what someone on another forum has to say about the michener sources: ( maybe I'll have to go get paper copies for some of these! ) Centennial' and 'Chesapeake' are very good and seem to have been professional lits. 'The Novel' is all one paragraph... 'Hawaii' seems to have been written completely in italics and is littered with page numbers. There seems to be something wrong with the lit file as its html is missing important elements. Calibre is unable to read the pdb version. 'The Bridges at Toko-Ri' is reasonably good. 'Space' is the result of an automated conversion that didn't really work. 'Recessional' has the odd error and has lost its structure, but is otherwise readable. 'Poland' is another automated conversion that hasn't been cleaned-up. 'The Covenant', 'Legacy', 'The Source' and 'Mexico' appear to be the raw output of OCR scans and are full of errors. Someone with the original text to hand would need to do a lot of work on these before they were remotely readable.....

10-11-2010, 11:13 AM	#10
theducks Well trained by Cats Posts: 29,817 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	There are plenty of legal reasons to use Calibre Conversions. Your reader R bit the dust and they no longer make Brand R which used a proprietary format. You get brand A which uses another (almost) proprietary format. Enter Calibre. which allows you to convert books YOU BOUGHT (not this license BS they have today).

10-11-2010, 11:35 AM	#11
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	well technically that's not legal either, your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently. they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)

10-11-2010, 11:46 AM	#13
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Jeez it's not like calibre is used only for purchased books. I use it to convert my personal documents all the time.

Advert

Advert