Quote:
Originally Posted by Tex2002ans
No. If you take a closer look, it's extremely likely to be:
- Missing all formatting information.
- All the italics, bold, etc. So you'll only get the raw plaintext itself.
- + Many paragraph breaks, especially if they cross pages.
- An OCR that happened automatically, full of typos/errors.
|
I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.
However, the effort involved was huge and the best partial solution I could come up with was to create a series of self-programmed interactive "assistant" utilities to semi-automate the process. Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.
Having experienced the challenges involved first-hand, my conclusion was that I don't think it's possible to create a magic one-click solution that would work for all PDFs. I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only.
I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, e.g. font used and (x,y) position on page. Unfortunately the drawback to this was that I had to create my own logic for rearranging the text snippets into correct reading order and identifying paragraph starts/ends. The font used can help identify chapter headings, italic/bold, dropcaps, small-caps. The (x,y) position can help identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.
P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.