Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-10-2010, 09:59 AM   #1
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
is it possible to remove ( from epub)...

leftover headers / footers with these alternating formats;

nn title
chapter-title nn

where nn is a 2 digit number, which maybe becomes 3 digits later on ?

i.e. original book would have had page number + book title on alternate pages, with chapter title +page number on the other pages.

not practical to do this via./rtf & word because of the ever changing actual page number

can it be done via regex & do I need to go into & back out of some other format en route. if so, what's a suitable syntax please

it seems to be quite a common left-over in books that have been though other conversion before I found them.

just removing title nn would be a start, if the varying chapter names are too difficult to automate.

I want to end up with mobi, I only have epub source. the default structure detection / header removal does not seem to shift this stuff ?

2 samples follow - where Mexico is book title &" the Spaniard" is a chapter title., & 32, 33 are page numbers:

nb these will often appear mid sentence, depending where the original page break occurred.
:
.......will be most happy to accept,' the girl's mother quickly replied,
32 Mexico
having no intention of leaving her daughter alone with any man ...

.....fight of the second matador.
The Spaniard 33
Dofia Raquel slapped her son's hand sharply and said, 'No ....
cybmole is offline   Reply With Quote
Old 10-10-2010, 10:42 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123457
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Yes, read up here:
http://calibre-ebook.com/user_manual/regexp.html
ldolse is offline   Reply With Quote
Old 10-10-2010, 10:50 AM   #3
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by ldolse View Post
ok -

a little play with the wizard indicates that this will work for title
<p class="calibre1">[0-9]+ Mexico</p>

but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
<p class="calibre1">The Cactus and the Maguey 11</p> where the text can be any phrase fragment which is followed by a number ?

still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -

Last edited by cybmole; 10-10-2010 at 11:10 AM.
cybmole is offline   Reply With Quote
Old 10-10-2010, 10:56 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cybmole View Post
ok - but if I just go epub to mobi, are the regex thingies applied during the conversion ?
Yes, if the Remove Header/Footer option is turned on (and the correct regex is used). See Structure Detection and the Remove Header/Footer options.
Starson17 is offline   Reply With Quote
Old 10-10-2010, 01:12 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123457
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
ok -

a little play with the wizard indicates that this will work for title
<p class="calibre1">[0-9]+ Mexico</p>

but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
<p class="calibre1">The Cactus and the Maguey 11</p> where the text can be any phrase fragment which is followed by a number ?

still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -
The tutorial does cover your scenario, but since you went to the effort of reading it here's your answer. You probably really don't want to do 'any phrase' followed by a number, I wouldn't risk it (too big a chance to remove real content), but you would want a regex that looks like this:
Code:
<p\sclass="calibre1"?>([0-9]+\s*Mexico\s*|[a-zA-Z\s]*?\d+\s*</p>
Be sure to check every single match in the wizard if you do that to make sure it doesn't overmatch. With a Michener book you might be checking for a while....

Though that regex could help you find all the chapter names for the safer thing:

A MUCH safer regex would be to look for all the chapter names and just put the beginning of them in your pattern:
Code:
<p\sclass="calibre1">([0-9]+\s*Mexico\s*|(The\sCactus|Start\s*of\s*Two|Start\s*of\s*Three).*?\d+\s*</p>
You can see there where the starting words of each chapter is separated by | and surrounded by parentheses.
ldolse is offline   Reply With Quote
Old 10-10-2010, 01:54 PM   #6
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
that has been most helpful - thanks to all. I have learned a little regex & also leaned that my source is not that great.
here's what someone on another forum has to say about the michener sources: ( maybe I'll have to go get paper copies for some of these! )

Centennial' and 'Chesapeake' are very good and seem to have been professional lits.
'The Novel' is all one paragraph...
'Hawaii' seems to have been written completely in italics and is littered with page numbers. There seems to be something wrong with the lit file as its html is missing important elements. Calibre is unable to read the pdb version.
'The Bridges at Toko-Ri' is reasonably good.
'Space' is the result of an automated conversion that didn't really work.
'Recessional' has the odd error and has lost its structure, but is otherwise readable.
'Poland' is another automated conversion that hasn't been cleaned-up.
'The Covenant', 'Legacy', 'The Source' and 'Mexico' appear to be the raw output of OCR scans and are full of errors. Someone with the original text to hand would need to do a lot of work on these before they were remotely readable.....
cybmole is offline   Reply With Quote
Old 10-10-2010, 07:05 PM   #7
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,897
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Quote:
Originally Posted by cybmole View Post
here's what someone on another forum has to say about the michener sources: ( maybe I'll have to go get paper copies for some of these! )
We're always glad to help anyone out, but there is no need to discuss on MobileRead how bad pirated source files may or may not be.
DoctorOhh is offline   Reply With Quote
Old 10-11-2010, 05:01 AM   #8
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by dwanthny View Post
We're always glad to help anyone out, but there is no need to discuss on MobileRead how bad pirated source files may or may not be.
techy -

I was just pointing out
A) that there are NO legal sources of these ebooks
B ) what sources there are, are crap

if that offends your sensibilities then have a mod remove the post, or even the whole thread.

but please - this smacks of hypocrisy - here's a free definition. http://en.wikipedia.org/wiki/Hypocrisy

- if everyone here did nothing but read & store their DRM'd purchases, there would be no need for calibre's conversion tools & no need for any how to discussions

now feel free to have the last word - point me at the forum rules , whatever & I'll shut up.
cybmole is offline   Reply With Quote
Old 10-11-2010, 06:28 AM   #9
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by cybmole View Post
techy -

I was just pointing out
A) that there are NO legal sources of these ebooks
B ) what sources there are, are crap

if that offends your sensibilities then have a mod remove the post, or even the whole thread.

but please - this smacks of hypocrisy - here's a free definition. http://en.wikipedia.org/wiki/Hypocrisy

- if everyone here did nothing but read & store their DRM'd purchases, there would be no need for calibre's conversion tools & no need for any how to discussions

now feel free to have the last word - point me at the forum rules , whatever & I'll shut up.
Shutting up might be a good idea. Discussing DRM removal and/or illegal (or gray area) book sources is generally not looked kindly upon. If I remember correctly, it is, in fact, in the forum rules.
And this has nothing to do with hypocrisy- I have several legal PDF files, that, were I to convert them, would require heavy use of regular expressions to remove junk.
Manichean is offline   Reply With Quote
Old 10-11-2010, 11:13 AM   #10
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,241
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
There are plenty of legal reasons to use Calibre Conversions.
Your reader R bit the dust and they no longer make Brand R which used a proprietary format. You get brand A which uses another (almost) proprietary format.
Enter Calibre. which allows you to convert books YOU BOUGHT (not this license BS they have today).
theducks is offline   Reply With Quote
Old 10-11-2010, 11:35 AM   #11
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently.
they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)
cybmole is offline   Reply With Quote
Old 10-11-2010, 11:42 AM   #12
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by cybmole View Post
well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently.
they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)
That depends on where you live. For example, converting anything is legal here in Germany, as long as you don't circumvent any "effective technical deterrents". (At least, AFAICR. And your value of "effective" may vary...)
Manichean is offline   Reply With Quote
Old 10-11-2010, 11:46 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Jeez it's not like calibre is used only for purchased books. I use it to convert my personal documents all the time.
kovidgoyal is offline   Reply With Quote
Old 10-11-2010, 11:57 AM   #14
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cybmole View Post
well technically that's not legal either,
your example is akin to converting your purchased music from CD to I-gizmo or vice versa. I don't have an issue with it but corporate lawyers see the world differently.
they'll argue that you don'' buy ( i.e. own) an e-book ever- you just own a licence to view it on a particular gadget in a particular format, until such time as they change their coprorate minds ( like in the 1984 kindle fiasco)
And then the EFF will step in and argue that format shifting is just as legal for ebooks as time shifting was for the Sony Betamax VCR and both sides will spend gobs of money going to the Supreme Court (for U.S. citizens) or to wherever (in other countries) and arguing about fair use and other legal niceties. In the end, each user has to decide for him/herself whether what he/she's doing is legal/moral, and then accept the consequences of that decision.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove underline from links in epub Amalthia Calibre 6 02-10-2014 08:41 AM
How To Remove White Border From Epub Cover crestfalleen Calibre 13 05-25-2010 12:21 PM
LRF to ePUB -- Remove Repeating Text mshneour Calibre 14 05-03-2010 11:00 PM
remove drm from epub macgeek21 ePub 10 01-26-2010 01:17 PM
PDB to epub, remove drm? Calybrid Calibre 5 01-09-2010 11:26 PM


All times are GMT -4. The time now is 05:55 PM.


MobileRead.com is a privately owned, operated and funded community.