MobileRead Forums - View Single Post

hoichi · 03-17-2014, 03:40 AM

PDF is a tough format to parse. Hard breaks everywhere, headers and footers are part of the text as well etc. etc. I haven't seen anything good come of messing with it, neither by Calibre conversion, nor by KOReader reflow (yes, I know KOReader is still deep in alpha).
Besides, there are reasons why some things are in PDFs. Some are just pictures, scanned but not OCRed. Other, like screenplays, have very specific formatting that is lost in conversion (I guess it's technically possible to make a screenplay that looks like one in ePub, so probably the hardest part would be parsing the PDF once again).

03-17-2014, 03:40 AM	#6
hoichi male solipsist pig Posts: 102 Karma: 440818 Join Date: Oct 2010 Location: Moscow Device: Nook Simple Touch→Kobo Aura HD	PDF is a tough format to parse. Hard breaks everywhere, headers and footers are part of the text as well etc. etc. I haven't seen anything good come of messing with it, neither by Calibre conversion, nor by KOReader reflow (yes, I know KOReader is still deep in alpha). Besides, there are reasons why some things are in PDFs. Some are just pictures, scanned but not OCRed. Other, like screenplays, have very specific formatting that is lost in conversion (I guess it's technically possible to make a screenplay that looks like one in ePub, so probably the hardest part would be parsing the PDF once again).