MobileRead Forums - View Single Post - Sigil 2.6.2

Derell Licht · 10-14-2025, 12:19 PM

In case you are curious about how this html file came to be, it is part of my new technique for converting PDF documents/books to epub...

In the past I would run the PDF through an OCR converter, which generated a text file (that usually required a ton of cleanup). The biggest headache of the resulting text file was that all the lines ended up with hard CR/LF characters at end of every line, which needed to be removed if I wanted to make the pages flow smoothly with changing screen dimensions.

But it recently occurred to me that if I just wrap basic html constructs around the text file (html, head, title, body), then the newline issue completely vanishes, because html and derivatives ignore those breaks!! So all I have to do then is walk through the file, deleting all hard-coded pagination lines, insert <p> at end of each paragraph, and I'm done; just import into Sigil to generate the epub, and I'm ready to publish...

My mistake here, was that I wanted to retain Internet Archive's signatures, so anyone looking at the code would know where I got it... so I took that header from some other file on the IA page (for this book) and imported into my document... but I didn't realize until now that I had some traps to look out for !!

I also wasn't aware of the issues with a large html file, which you pointed out to me here... I just went back and added split markers at all the new-chapter points.

10-14-2025, 12:19 PM	#15
Derell Licht Member Posts: 21 Karma: 10 Join Date: Jul 2016 Location: Fremont, CA Device: Kindle Paperwhite Signature Edition	In case you are curious about how this html file came to be, it is part of my new technique for converting PDF documents/books to epub... In the past I would run the PDF through an OCR converter, which generated a text file (that usually required a ton of cleanup). The biggest headache of the resulting text file was that all the lines ended up with hard CR/LF characters at end of every line, which needed to be removed if I wanted to make the pages flow smoothly with changing screen dimensions. But it recently occurred to me that if I just wrap basic html constructs around the text file (html, head, title, body), then the newline issue completely vanishes, because html and derivatives ignore those breaks!! So all I have to do then is walk through the file, deleting all hard-coded pagination lines, insert <p> at end of each paragraph, and I'm done; just import into Sigil to generate the epub, and I'm ready to publish... My mistake here, was that I wanted to retain Internet Archive's signatures, so anyone looking at the code would know where I got it... so I took that header from some other file on the IA page (for this book) and imported into my document... but I didn't realize until now that I had some traps to look out for !! I also wasn't aware of the issues with a large html file, which you pointed out to me here... I just went back and added split markers at all the new-chapter points. Last edited by Derell Licht; 10-14-2025 at 04:00 PM.