MobileRead Forums - View Single Post - PDF to Kindle: The unobtainable Holy Grail of ebooks

tentimes · 10-15-2011, 06:53 AM

Is it a certainty that pdf books do not contain a load of text boxes with the actual text still decipherable? As in, I would doubt that it is a bitmip of the font etc. Apologies if I am wrong, but I am going through it now with a hex editor trying to make sense of it.

I thought that rather than go the whole hog of OCR that we would be able to get a series of draw text commands, the text being in boxes, and taking that all together, with an intelligent interpretation of the paragraphs that it might be possible, as opposed to going the whole OCR hog.

I know I am new to this, but with the size of the files relative to pages in the book I would be surprised if this wasn't the case.

If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is), and doing a *slightly* (and it really is slightly in terms of logic) better job of interpreting the logic.

Can anyone point me in the direction of a good dissection of PDF as a format please? I think I am going to have a go at this. If I do, then I undertake to make it open source. If I paid for a book once I'm not paying for it again,

10-15-2011, 06:53 AM	#14
tentimes Junior Member Posts: 6 Karma: 10 Join Date: Oct 2011 Device: Kindle 4	Is it a certainty that pdf books do not contain a load of text boxes with the actual text still decipherable? As in, I would doubt that it is a bitmip of the font etc. Apologies if I am wrong, but I am going through it now with a hex editor trying to make sense of it. I thought that rather than go the whole hog of OCR that we would be able to get a series of draw text commands, the text being in boxes, and taking that all together, with an intelligent interpretation of the paragraphs that it might be possible, as opposed to going the whole OCR hog. I know I am new to this, but with the size of the files relative to pages in the book I would be surprised if this wasn't the case. If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is), and doing a slightly (and it really is slightly in terms of logic) better job of interpreting the logic. Can anyone point me in the direction of a good dissection of PDF as a format please? I think I am going to have a go at this. If I do, then I undertake to make it open source. If I paid for a book once I'm not paying for it again,