In case you are curious about how this html file came to be, it is part of my new technique for converting PDF documents/books to epub...
In the past I would run the PDF through an OCR converter, which generated a text file (that usually required a ton of cleanup). The biggest headache of the resulting text file was that all the lines ended up with hard CR/LF characters at end of every line, which needed to be removed if I wanted to make the pages flow smoothly with changing screen dimensions.
But it recently occurred to me that if I just wrap basic html constructs around the text file (html, head, title, body), then the newline issue completely vanishes, because html and derivatives ignore those breaks!! So all I have to do then is walk through the file, deleting all hard-coded pagination lines, insert <p> at end of each paragraph, and I'm done; just import into Sigil to generate the epub, and I'm ready to publish...
My mistake here, was that I wanted to retain Internet Archive's signatures, so anyone looking at the code would know where I got it... so I took that header from some other file on the IA page (for this book) and imported into my document... but I didn't realize until now that I had some traps to look out for !!
I also wasn't aware of the issues with a large html file, which you pointed out to me here... I just went back and added split markers at all the new-chapter points.
Last edited by Derell Licht; 10-14-2025 at 04:00 PM.
|