I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand.
So my method is to use various Linux tools. Get the images out with pdftopng or pdfimages. If they are really terrible, run them through Scan Taylor Advanced. Minor corrections can be done with ImageMagick. Do my own OCR using OCRFeeder, a front-end for tesseract. The multi-column problem here is handled by OCRFeeder being able to do one column at a time, and also avoid advertisements, handle the "continued on page 99" situation, and so on. Copy the OCR text into LibreOffice...proof it there, bring it into Calibre, and convert to epub. Tweak the code in the Calibre Editor as needed. Any images, tables and the like can be dealt with as necessary, case-by-case. I use Gimp to handle any image editing needed.
Yes this is labour intensive. But it works and ends up with a really good epub. You will never find a script-kiddie solution to getting good results out of a multi-column pdf, especially when there all sorts of interruptions to nice clean columns.
|