MobileRead Forums - View Single Post

DSpider · 01-24-2012, 04:34 AM

I'll just copy-paste this because I'm tired of explaining it every single day, in some form or another:

"PDF is the worst possible format to convert FROM. It was designed as an output format. This subject has been beaten to death around here because a lot of PDFs aren't tagged PDFs - meaning that letters (and a lot of times small groups of letters) resemble something like floating objects on a blank paper, each with their own coordinates and extra baggage. So it's very difficult to get a 1:1 conversion. A lot of formatting will be lost, some will get interpreted wrong, etc..."

Adobe Reader (which is free) can export to .txt but you'll lose a lot of formatting (italics, bolds, etc) and it's not guaranteed that you won't get misplaced paragraphs at the end of the document or paragraphs in a different order. It's always better to go back to the original source (the initial .rtf, .doc, .docx, .odt, etc file) and go from there using OpenOffice/LibreOffice, Atlantis, Word and so on.

Or you could re-OCR the PDF with ABBYY FineReader and go from there.

01-24-2012, 04:34 AM	#2
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	I'll just copy-paste this because I'm tired of explaining it every single day, in some form or another: "PDF is the worst possible format to convert FROM. It was designed as an output format. This subject has been beaten to death around here because a lot of PDFs aren't tagged PDFs - meaning that letters (and a lot of times small groups of letters) resemble something like floating objects on a blank paper, each with their own coordinates and extra baggage. So it's very difficult to get a 1:1 conversion. A lot of formatting will be lost, some will get interpreted wrong, etc..." Adobe Reader (which is free) can export to .txt but you'll lose a lot of formatting (italics, bolds, etc) and it's not guaranteed that you won't get misplaced paragraphs at the end of the document or paragraphs in a different order. It's always better to go back to the original source (the initial .rtf, .doc, .docx, .odt, etc file) and go from there using OpenOffice/LibreOffice, Atlantis, Word and so on. Or you could re-OCR the PDF with ABBYY FineReader and go from there.