View Single Post
Old 03-03-2016, 05:52 PM   #9
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,282
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
There is no "best" method as it depends a lot on the PDF in question. There are PDF that can be converted perfectly if they originated from a text processor and the original source was perfectly formatted and the PDF did not use any obfuscations or visual elements. There are PDF that can not be converted at all so you have to resort to pure imagery based OCR. It all depends on the situation.

My favourite way of converting PDF is to use poppler's pdftohtml -xml which for each line of text gives the coordinates and font properties in the XML metadata; and then write a small script specific to that PDF which turns it into a properly structured document. But doing it this way requires some scripting / regular expression knowledge.
frostschutz is offline   Reply With Quote