Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 03-30-2019, 06:30 PM   #1
Terrymonkey
Junior Member
Terrymonkey began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Mar 2019
Device: kindle
Red face can calibre remove all page breaks in document?

Hi I have three documents converted from a pdf with abbyy. They are odt, txt and rtf. Can calibre convert any of them to azw, or mobi and remove the page breaks from the original document. I think I did it once, but can't remember how. Heuristic proceessing may have done it.
Otherwise I may abandon my project, or just create an ebook with gaps allover the place. The original pdf is 400 pages long and I don't want to do it by hand!

Any help appreciated.
Terrymonkey is offline   Reply With Quote
Old 03-31-2019, 01:37 AM   #2
CRussel
(he/him/his)
CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.
 
CRussel's Avatar
 
Posts: 12,160
Karma: 79742714
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), Fire HD 8
First, don't attempt to do this as a conversion. I suppose it's possible, but a pain to troubleshoot the regex. Instead, convert to ePub, then edit the ebook and use the Regex Search and Replace there. (For the help on this, see the Calibre manual and All About Using Regular Expressions in Calibre.)

What is required will vary somewhat, depending on how your PDF converter renders new pages. And/or what page marks are in the book. But I started out with something like this, and then tweeked it quite a bit for the Modesty Blaise books I was converting.
PHP Code:
</p>\s*<div\s+class="newpage"\s+id="page-[0-9]*"></div>\s*<p
That's find the end-of-paragraph followed by some indeterminate amount of whitespace, followed by <div, plus some more whitespace, then class="newpage", whitespace again, then id="page- and one or more page numbers then a closing quote, a closing HTML div tag, some more whitespace, and finally the paragraph tag for the next paragraph.

I would put <p>\n</p> in the replace box. Of course, this has the problem of inserting a break where you might not be at the end of the sentence.

Overall, this is something you'll have to play with a bit. DO read the regex link above, it will very much help you build the regex that will work for your specific books.

ETA: Note that the above isn't really PHP, but using that tag gives you some minimal syntax highlighting, possibly making it easier to parse. Or, perhaps not. Also, fixed the regex link in the first PP.

Last edited by CRussel; 03-31-2019 at 01:46 PM.
CRussel is online now   Reply With Quote
Advert
Old 03-31-2019, 11:42 AM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,804
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Seconded
Do as standard Convert (no tricky stuff done), then cleanup using the Editor.
Every document seems to do it differently (and some even do this by section).

Find a problem: Fix it! Pay attention to the order you choose: Fixing the hardest (complicated) first is usually best, because they may have contained a simpler issue, that is now removed, making the detection step even harder)

Repeat until you find no more things to do.
theducks is offline   Reply With Quote
Old 03-31-2019, 01:56 PM   #4
CRussel
(he/him/his)
CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.
 
CRussel's Avatar
 
Posts: 12,160
Karma: 79742714
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), Fire HD 8
Another trick I would suggest, based on experience with a number of books I've cleaned up this way -- Don't try to do it all in one pass. Take an iterative approach. So, in the example I showed above, I would have then done a search for \w*\b</p>\n<p> (one or more alphanumeric characters, followed by a word boundary, an end of paragraph tag, a newline, and paragraph tag.) This would catch where my earlier substitution had created a line break between two words in a sentence, rather than at the end of a sentence.

Still not perfect, but by now I'm starting to create a readable book instead of one that's too annoying to bother.
CRussel is online now   Reply With Quote
Old 04-01-2019, 10:00 AM   #5
Terrymonkey
Junior Member
Terrymonkey began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Mar 2019
Device: kindle
Guys, I really appreciate your help. It looks like I am going to have to spend several hours revising/learning html editing and regex, but I knew I would sometime soon. I'll try and have a go myself, but in all honesty I think it likely I will be posting again in he next few weeks.
Thanks again
Robin
Terrymonkey is offline   Reply With Quote
Advert
Old 04-01-2019, 11:13 AM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,804
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Terrymonkey View Post
Guys, I really appreciate your help. It looks like I am going to have to spend several hours revising/learning html editing and regex, but I knew I would sometime soon. I'll try and have a go myself, but in all honesty I think it likely I will be posting again in he next few weeks.
Thanks again
Robin
No problem with this type of helping from here. We help, you learn.
Google: cheat sheets. Grab the CSS, Regex (understand that not all work with PCRE), and HTML (again, EPUB is a subset, not the whole banana )
theducks is offline   Reply With Quote
Old 04-02-2019, 01:34 AM   #7
CRussel
(he/him/his)
CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.
 
CRussel's Avatar
 
Posts: 12,160
Karma: 79742714
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), Fire HD 8
Yup, same here. Happy to stick my oar in to help.

Charlie.
CRussel is online now   Reply With Quote
Reply

Tags
calibre, convert, pagebreak remove


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
remove page breaks from separate stylesheets ni_c Conversion 2 11-23-2015 01:28 AM
Remove All Page Breaks in MOBI or EPUB? johnnyr Conversion 3 11-10-2014 09:05 PM
Epub to PDF can't remove page breaks between headers Vykan12 Calibre 11 07-25-2012 03:01 AM
page breaks in html document michaelsmith1983 Conversion 1 03-06-2012 10:32 PM
Remove Chapter Breaks and Search an Entire Document Marcy Sigil 6 04-15-2010 12:11 PM


All times are GMT -4. The time now is 12:07 PM.


MobileRead.com is a privately owned, operated and funded community.