03-30-2019, 06:30 PM | #1 |
Junior Member
Posts: 6
Karma: 10
Join Date: Mar 2019
Device: kindle
|
can calibre remove all page breaks in document?
Hi I have three documents converted from a pdf with abbyy. They are odt, txt and rtf. Can calibre convert any of them to azw, or mobi and remove the page breaks from the original document. I think I did it once, but can't remember how. Heuristic proceessing may have done it.
Otherwise I may abandon my project, or just create an ebook with gaps allover the place. The original pdf is 400 pages long and I don't want to do it by hand! Any help appreciated. |
03-31-2019, 01:37 AM | #2 |
(he/him/his)
Posts: 12,160
Karma: 79742714
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), Fire HD 8
|
First, don't attempt to do this as a conversion. I suppose it's possible, but a pain to troubleshoot the regex. Instead, convert to ePub, then edit the ebook and use the Regex Search and Replace there. (For the help on this, see the Calibre manual and All About Using Regular Expressions in Calibre.)
What is required will vary somewhat, depending on how your PDF converter renders new pages. And/or what page marks are in the book. But I started out with something like this, and then tweeked it quite a bit for the Modesty Blaise books I was converting. PHP Code:
I would put <p>\n</p> in the replace box. Of course, this has the problem of inserting a break where you might not be at the end of the sentence. Overall, this is something you'll have to play with a bit. DO read the regex link above, it will very much help you build the regex that will work for your specific books. ETA: Note that the above isn't really PHP, but using that tag gives you some minimal syntax highlighting, possibly making it easier to parse. Or, perhaps not. Also, fixed the regex link in the first PP. Last edited by CRussel; 03-31-2019 at 01:46 PM. |
Advert | |
|
03-31-2019, 11:42 AM | #3 |
Well trained by Cats
Posts: 29,804
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Seconded
Do as standard Convert (no tricky stuff done), then cleanup using the Editor. Every document seems to do it differently (and some even do this by section). Find a problem: Fix it! Pay attention to the order you choose: Fixing the hardest (complicated) first is usually best, because they may have contained a simpler issue, that is now removed, making the detection step even harder) Repeat until you find no more things to do. |
03-31-2019, 01:56 PM | #4 |
(he/him/his)
Posts: 12,160
Karma: 79742714
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), Fire HD 8
|
Another trick I would suggest, based on experience with a number of books I've cleaned up this way -- Don't try to do it all in one pass. Take an iterative approach. So, in the example I showed above, I would have then done a search for \w*\b</p>\n<p> (one or more alphanumeric characters, followed by a word boundary, an end of paragraph tag, a newline, and paragraph tag.) This would catch where my earlier substitution had created a line break between two words in a sentence, rather than at the end of a sentence.
Still not perfect, but by now I'm starting to create a readable book instead of one that's too annoying to bother. |
04-01-2019, 10:00 AM | #5 |
Junior Member
Posts: 6
Karma: 10
Join Date: Mar 2019
Device: kindle
|
Guys, I really appreciate your help. It looks like I am going to have to spend several hours revising/learning html editing and regex, but I knew I would sometime soon. I'll try and have a go myself, but in all honesty I think it likely I will be posting again in he next few weeks.
Thanks again Robin |
Advert | |
|
04-01-2019, 11:13 AM | #6 | |
Well trained by Cats
Posts: 29,804
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Google: cheat sheets. Grab the CSS, Regex (understand that not all work with PCRE), and HTML (again, EPUB is a subset, not the whole banana ) |
|
04-02-2019, 01:34 AM | #7 |
(he/him/his)
Posts: 12,160
Karma: 79742714
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), Fire HD 8
|
Yup, same here. Happy to stick my oar in to help.
Charlie. |
Tags |
calibre, convert, pagebreak remove |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
remove page breaks from separate stylesheets | ni_c | Conversion | 2 | 11-23-2015 01:28 AM |
Remove All Page Breaks in MOBI or EPUB? | johnnyr | Conversion | 3 | 11-10-2014 09:05 PM |
Epub to PDF can't remove page breaks between headers | Vykan12 | Calibre | 11 | 07-25-2012 03:01 AM |
page breaks in html document | michaelsmith1983 | Conversion | 1 | 03-06-2012 10:32 PM |
Remove Chapter Breaks and Search an Entire Document | Marcy | Sigil | 6 | 04-15-2010 12:11 PM |