View Single Post
Old 03-31-2019, 01:37 AM   #2
CRussel
(he/him/his)
CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.CRussel ought to be getting tired of karma fortunes by now.
 
CRussel's Avatar
 
Posts: 12,296
Karma: 80074820
Join Date: Jul 2010
Location: Sunshine Coast, BC
Device: Oasis (Gen3),Paperwhite (Gen10), Voyage, Paperwhite(orig), iPad Air M3
First, don't attempt to do this as a conversion. I suppose it's possible, but a pain to troubleshoot the regex. Instead, convert to ePub, then edit the ebook and use the Regex Search and Replace there. (For the help on this, see the Calibre manual and All About Using Regular Expressions in Calibre.)

What is required will vary somewhat, depending on how your PDF converter renders new pages. And/or what page marks are in the book. But I started out with something like this, and then tweeked it quite a bit for the Modesty Blaise books I was converting.
PHP Code:
</p>\s*<div\s+class="newpage"\s+id="page-[0-9]*"></div>\s*<p
That's find the end-of-paragraph followed by some indeterminate amount of whitespace, followed by <div, plus some more whitespace, then class="newpage", whitespace again, then id="page- and one or more page numbers then a closing quote, a closing HTML div tag, some more whitespace, and finally the paragraph tag for the next paragraph.

I would put <p>\n</p> in the replace box. Of course, this has the problem of inserting a break where you might not be at the end of the sentence.

Overall, this is something you'll have to play with a bit. DO read the regex link above, it will very much help you build the regex that will work for your specific books.

ETA: Note that the above isn't really PHP, but using that tag gives you some minimal syntax highlighting, possibly making it easier to parse. Or, perhaps not. Also, fixed the regex link in the first PP.

Last edited by CRussel; 03-31-2019 at 01:46 PM.
CRussel is offline   Reply With Quote