I've converted some books that are available as PDF + TXT from archive.org. I use sumatra for opening the PDFs since it knows about the invisible/hidden text layer; maybe they all do.
I use sigil and don't need to manually remove the mid-line carriage returns; you can use a search and replace to replace the blank lines between paragraphs with end paragraph tag followed by beginning paragraph tag; </p><p>. Then jump to the top of the book and add the missing beginning of paragraph tag, then to the bottom of the book and add the missing end of paragraph tag. Then use sigil's Mend and Prettify to make it look good in sigil. The hyphens that were at the ends of lines can be found by searching for hyphen followed by a space; you can't remove them all because sometimes it was a word that's normally hyphenated.
What archive.org uses often sees a screechmark/! or ell/l as a 1 so search for digits; there are threads here about this and other common errors and regexps for searching with.
Last edited by hobnail; 07-08-2020 at 05:15 PM.
|