MobileRead Forums - View Single Post

Elfwreck · 03-15-2012, 11:51 AM

FWIW, extracting text *mostly* works. I'd say 85% or more of text-based PDFs (not scans) convert fairly well to Word or HTML formats... and then need cleanup. Remove the headers & page #'s, which extract as just text. Get rid of the forced paragraph breaks at the ends of pages. Find the chapter headers and fix them. (They might be fine. They might be converted to plain text, depending on various font issues.) Look for sets of short lines of text--dialogue especially--that were all crammed into one paragraph.

The text itself tends to extract fine (if there weren't columns or magazine layouts to deal with), but the formatting needs a thorough touchup to be useful.

03-15-2012, 11:51 AM	#14
Elfwreck Grand Sorcerer Posts: 5,187 Karma: 25133758 Join Date: Nov 2008 Location: SF Bay Area, California, USA Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)	FWIW, extracting text mostly works. I'd say 85% or more of text-based PDFs (not scans) convert fairly well to Word or HTML formats... and then need cleanup. Remove the headers & page #'s, which extract as just text. Get rid of the forced paragraph breaks at the ends of pages. Find the chapter headers and fix them. (They might be fine. They might be converted to plain text, depending on various font issues.) Look for sets of short lines of text--dialogue especially--that were all crammed into one paragraph. The text itself tends to extract fine (if there weren't columns or magazine layouts to deal with), but the formatting needs a thorough touchup to be useful.