MobileRead Forums - View Single Post

michaelbr · 08-12-2023, 02:28 AM

Quote:

Originally Posted by AlanHK

You need to be aware that there are several different kinds of PDFs. One is made by scanning text.
So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it.

Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth.

With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre.
You will have a lot more work then to clean it up.
If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS.

If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub.

And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/

No matter how you do it you need to invest hours at least to clean up and check.
Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.

Thanks for this detailed explanation and tips, will check them out.