MobileRead Forums - View Single Post - Good way to convert pdfs to epubs on the mac?

MarjaE · 04-02-2017, 02:00 AM

... Another few tries.

* If you have a good text layer, and you want to extract that layer, most Macos Sierra applications won't handle ligatures and will substitute blank spaces for ff and fi ligatures and probably others. I understand it may not handle superscript either.

* If you don't have a good text layer, you will need ocr to create one, before you can extract that layer. Tesseract e.g. Elucidate is good for short passages, but do you want to correct errors across an entire ocred book? Abbyy Finereader might work better.

* Sometimes ocr merges columns in 2-column or 3-column view. Sometimes ocr separates columns in tables. The more it avoids one error, the more it's likely to run into the other. Processing before ocr makes text recognition errors more likely, but processing with k2pdfopt might make column recognition errors less likely. I haven't tested this fix.

* If the original format isn't important, and if the ligature bug gets fixed, then extracting the text and manually re-inserting pictures and tables may be a workable fix. I haven't gotten this working yet though.

* In the case of Internet Archive texts, there're usually epub and/or txt versions as well as the pdf version.

04-02-2017, 02:00 AM	#19
MarjaE Guru Posts: 941 Karma: 53902736 Join Date: Jun 2015 Device: multiple	... Another few tries. * If you have a good text layer, and you want to extract that layer, most Macos Sierra applications won't handle ligatures and will substitute blank spaces for ff and fi ligatures and probably others. I understand it may not handle superscript either. * If you don't have a good text layer, you will need ocr to create one, before you can extract that layer. Tesseract e.g. Elucidate is good for short passages, but do you want to correct errors across an entire ocred book? Abbyy Finereader might work better. * Sometimes ocr merges columns in 2-column or 3-column view. Sometimes ocr separates columns in tables. The more it avoids one error, the more it's likely to run into the other. Processing before ocr makes text recognition errors more likely, but processing with k2pdfopt might make column recognition errors less likely. I haven't tested this fix. * If the original format isn't important, and if the ligature bug gets fixed, then extracting the text and manually re-inserting pictures and tables may be a workable fix. I haven't gotten this working yet though. * In the case of Internet Archive texts, there're usually epub and/or txt versions as well as the pdf version.