Thread: PDF to text
View Single Post
Old 07-28-2010, 04:08 PM   #17
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Quote:
Originally Posted by mike_bike_kite View Post
I can't understand why there aren't simple post processors to process the text. Take the text output and join the lines together unless they end in a full stop, a question mark or a double quote.
Of course there are, using algorithms more sophisticated than this, too. I think most of the software mentioned in the thread does this already, to greater or lesser degrees of success. There are also dedicated tools like ebook-tidy, and so on.

Quote:
My aim was to finally generate HTML and then use the chapter titles to create a TOC. I got halfway there but considering how clever tools like Calibre are, it surprised me that this wasn't done automatically.
Calibre provides a place to enter a regular expressions for deleting headers and footers IIRC. pdfreflow attempts to do this automatically; but this is a difficult thing to get right, and I think such software naturally tends to err towards not deleting something when in doubt as opposed to deleting it.
frabjous is offline   Reply With Quote