MobileRead Forums - View Single Post

frabjous · 11-22-2010, 06:05 PM

Personally, I would just use BRISS to cut each PDF page into readable sized chunks. You could do four, six or even more chunks per page.

I'd just leave the results in PDF format; so long as the chunks are readable, that's what really matters, and there's far less of a chance of losing formatting that way.

If you're really intent on extracting the text, I'd use HTML as your intermediate format. Something like Acrobat might be able to recognize the columns and extract the text appropriately. If you need a free solution, pdftohtml (used, e.g., alongside http://sourceforge.net/projects/pdfreflow/pdfreflow) claims to be able to do (and their webpage gives an example of such), but I've got mixed results from trying this. But in honesty, you'd still probably be best off using BRISS to separate the columns before attempting to extract the text. (Since these aren't usually true crops but just redefining the bounding boxes, though it's still not entirely clear this will work.)

11-22-2010, 06:05 PM	#4
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Personally, I would just use BRISS to cut each PDF page into readable sized chunks. You could do four, six or even more chunks per page. I'd just leave the results in PDF format; so long as the chunks are readable, that's what really matters, and there's far less of a chance of losing formatting that way. If you're really intent on extracting the text, I'd use HTML as your intermediate format. Something like Acrobat might be able to recognize the columns and extract the text appropriately. If you need a free solution, pdftohtml (used, e.g., alongside http://sourceforge.net/projects/pdfreflow/pdfreflow) claims to be able to do (and their webpage gives an example of such), but I've got mixed results from trying this. But in honesty, you'd still probably be best off using BRISS to separate the columns before attempting to extract the text. (Since these aren't usually true crops but just redefining the bounding boxes, though it's still not entirely clear this will work.)