So I wonder who has tried more or less every program to export PDF to Word/HTML and compared results? Of course, corrections are a lot of work, but I am looking to minimize certain things:
1. diacritics. Some publishers use combining marks, and so far I've found only Acrobat handles them correctly. The best of the commercial PDF -> Word converters I've tried insert spaces in such words.
2. hyphenation - PDF2Office is the only app I know of that can attempt to remove hyphenation with a dictionary.
3. paragraph spacing. Let's say some paragraph style has some spacing above, such as 1 or half a line. Export results so far from all I can remember vary in margin settings, e.g., a paragraph that has half a line of space above, might have a top-margin of 5pt, 6pt or anything in between or close. I can end up with a zillion unique paragraph styles, making it difficult to fix. Even Acrobat HTML and Word export vary with different results, with HTML using whole numbers in inline-CSS making it somewhat easier to correct.
Struggling with this now, as I have a reference work I'm trying to export, formatted similar to a dictionary, a set amount of space above each entry. Yet there are also many other paragraphs with the same style of space above so using some regex such as new entry begins with bold-italic isn't reliable. Acrobat too can sometimes make errors and make such paragraphs have no top margin.
4. columns. I haven't tried this in a while, but perhaps some apps might not reliably separate left and right margins, mixing them together by line from top to bottom, left to right. In the PDF I'm trying to currently convert, top page headers with entry name and page number I'd like to use regex to make EPUB 3 page numbers yet sometimes the left top-header is inserted correctly at the top of the page, and the right part with page number inserted at top of the right margin, making it useless.
I haven't recently tried various other commercial PDF apps such as Nitro, Phantom, etc. Maybe those might have other issues. I suppose with each PDF, one perhaps must try them all and see which is the best in each case.
|