MobileRead Forums - View Single Post

BensonBear · 05-25-2012, 04:01 PM

Quote:

Originally Posted by Mattypants

Since pdfs seem to be a bit of a fail on the KT, can they be reliably converted to epubs? (I'm guilty of not Googling this before asking, but wondering if any of you have had any success with it). Will report back if I have any myself.

It is far more reasonable to attempt to convert pdfs on one's own rather than expect Kobo to do it. There is no clear specification of what "reflowing" a pdf even means and different people will be differently satisfied by different ways of doing this. What Kobo SHOULD do is read pdfs in their native format by zooming automatically in portrait mode to the maximum possible (subject to user modification when it guesses wrong) and have nice commands to advance in the book that do not require panning. This will probably satisfy a lot more people a lot more easily.

I have not found any program that does a nice pdf convert to "reflow" it. Calibre suggests you use search and replace to match the headers and footers to remove them, but this seems very difficult. Then it will not figure out either how to join the text on the two successive pages.

Rather than local regular expressions to match headers and footers, I think one needs a heuristic like "the font has changed near the end of the page, and remains mostly changed until the end of the page, perhaps after a space which is larger than usual". Calibre or some other program could add a set of hueristics like this that the user could apply before converting. In the source for pdftohtml, they can be applied at the point where it has gathered up all of the text for the page into a set of line segments.

If I keep my Kobo I will probably try to add these to pdftohtml. Including one to detect typical footnotes and divert them to another section of the epub, and add a link to that section. Then you can go back and forth between text and footnotes, and this can probably be done without the extremely difficult task of finding out how footnotes are represented in the particular text. All one needs to do is detect where the footnote text starts at the bottom of the page and this can probably be done in many texts without too much difficulty.
But I think the heuristics need to be page-global, based on physical spacing and text size, not local regular expressions.