Quote:
Originally Posted by tentimes
If it's a matter of a series of text boxes per page, then it's a matter of (assuming most pages don't overlap these box areas and overprint) taking the text boxes in order, getting the relative font sizes, assuming the large font sizes with the text form "Chapter XX" are start of chapter
|
Have you tried Calibre? The heuristic processing option does this sort of thing, but is disabled by default.
Quote:
Originally Posted by tentimes
"Chapter XX" are start of chapter of there is no internal byte code to five you end of chapter (which I bet there is)
|
Honestly, I'd be prepared to bet there isn't, but I'd like to be wrong.
I've found pdftohtml gives good results with some PDFs. Calibre and pdftohtml are both open source, so if you do decide to try and write something better, it might be worth having a look at how they do things.