MobileRead Forums - View Single Post

frostschutz · 03-03-2016, 05:52 PM

There is no "best" method as it depends a lot on the PDF in question. There are PDF that can be converted perfectly if they originated from a text processor and the original source was perfectly formatted and the PDF did not use any obfuscations or visual elements. There are PDF that can not be converted at all so you have to resort to pure imagery based OCR. It all depends on the situation.

My favourite way of converting PDF is to use poppler's pdftohtml -xml which for each line of text gives the coordinates and font properties in the XML metadata; and then write a small script specific to that PDF which turns it into a properly structured document. But doing it this way requires some scripting / regular expression knowledge.

03-03-2016, 05:52 PM	#9
frostschutz Linux User Posts: 2,282 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	There is no "best" method as it depends a lot on the PDF in question. There are PDF that can be converted perfectly if they originated from a text processor and the original source was perfectly formatted and the PDF did not use any obfuscations or visual elements. There are PDF that can not be converted at all so you have to resort to pure imagery based OCR. It all depends on the situation. My favourite way of converting PDF is to use poppler's pdftohtml -xml which for each line of text gives the coordinates and font properties in the XML metadata; and then write a small script specific to that PDF which turns it into a properly structured document. But doing it this way requires some scripting / regular expression knowledge.