View Single Post
Old 11-22-2010, 06:05 PM   #4
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Personally, I would just use BRISS to cut each PDF page into readable sized chunks. You could do four, six or even more chunks per page.



I'd just leave the results in PDF format; so long as the chunks are readable, that's what really matters, and there's far less of a chance of losing formatting that way.

If you're really intent on extracting the text, I'd use HTML as your intermediate format. Something like Acrobat might be able to recognize the columns and extract the text appropriately. If you need a free solution, pdftohtml (used, e.g., alongside http://sourceforge.net/projects/pdfreflow/pdfreflow) claims to be able to do (and their webpage gives an example of such), but I've got mixed results from trying this. But in honesty, you'd still probably be best off using BRISS to separate the columns before attempting to extract the text. (Since these aren't usually true crops but just redefining the bounding boxes, though it's still not entirely clear this will work.)
frabjous is offline   Reply With Quote