View Single Post
Old 11-11-2023, 02:43 PM   #11
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,252
Karma: 16544692
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
Quote:
Originally Posted by Tex2002ans View Post
No. If you take a closer look, it's extremely likely to be:
  • Missing all formatting information.
    • All the italics, bold, etc. So you'll only get the raw plaintext itself.
    • + Many paragraph breaks, especially if they cross pages.
  • An OCR that happened automatically, full of typos/errors.
I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.

However, the effort involved was huge and the best partial solution I could come up with was to create a series of self-programmed interactive "assistant" utilities to semi-automate the process. Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.

Having experienced the challenges involved first-hand, my conclusion was that I don't think it's possible to create a magic one-click solution that would work for all PDFs. I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only.

I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, e.g. font used and (x,y) position on page. Unfortunately the drawback to this was that I had to create my own logic for rearranging the text snippets into correct reading order and identifying paragraph starts/ends. The font used can help identify chapter headings, italic/bold, dropcaps, small-caps. The (x,y) position can help identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.

P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.
jackie_w is offline   Reply With Quote