View Single Post
Old 12-09-2017, 12:53 PM   #11
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by RobertDDL View Post
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)
Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.

I'm sure that everyone has their own preferred way of working. I have a ton of expertise in Word, so frankly, it's super-easy for me to do the cleanup on Abbyy output into a Word file, or, of course, regex it to the nth, in HTML. If there ARE italics and bold, in large numbers, I'll use Toxaris' superb "ePUB Tools" Word plug-in, first--as that makes marking/retaining both of those character markups simplicity itself--and I'll "clean" the styles from the rest, and then restyle them. I find that the fastest route.

Offered solely FWIW.

Hitch
Hitch is offline   Reply With Quote