View Single Post
Old 10-19-2011, 08:51 AM   #38
avantman42
Wizard
avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.
 
avantman42's Avatar
 
Posts: 1,090
Karma: 6058305
Join Date: Sep 2010
Location: UK
Device: Kindle Paperwhite
Quote:
Originally Posted by tentimes View Post
If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter
Have you tried Calibre? The heuristic processing option does this sort of thing, but is disabled by default.

Quote:
Originally Posted by tentimes View Post
"Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is)
Honestly, I'd be prepared to bet there isn't, but I'd like to be wrong.

I've found pdftohtml gives good results with some PDFs. Calibre and pdftohtml are both open source, so if you do decide to try and write something better, it might be worth having a look at how they do things.
avantman42 is offline   Reply With Quote