View Single Post
Old 10-15-2011, 06:53 AM   #14
tentimes
Junior Member
tentimes began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2011
Device: Kindle 4
Is it a certainty that pdf books do not contain a load of text boxes with the actual text still decipherable? As in, I would doubt that it is a bitmip of the font etc. Apologies if I am wrong, but I am going through it now with a hex editor trying to make sense of it.

I thought that rather than go the whole hog of OCR that we would be able to get a series of draw text commands, the text being in boxes, and taking that all together, with an intelligent interpretation of the paragraphs that it might be possible, as opposed to going the whole OCR hog.

I know I am new to this, but with the size of the files relative to pages in the book I would be surprised if this wasn't the case.

If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is), and doing a *slightly* (and it really is slightly in terms of logic) better job of interpreting the logic.

Can anyone point me in the direction of a good dissection of PDF as a format please? I think I am going to have a go at this. If I do, then I undertake to make it open source. If I paid for a book once I'm not paying for it again,
tentimes is offline   Reply With Quote