View Single Post
Old 04-02-2017, 02:00 AM   #19
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 941
Karma: 53902736
Join Date: Jun 2015
Device: multiple
... Another few tries.

* If you have a good text layer, and you want to extract that layer, most Macos Sierra applications won't handle ligatures and will substitute blank spaces for ff and fi ligatures and probably others. I understand it may not handle superscript either.

* If you don't have a good text layer, you will need ocr to create one, before you can extract that layer. Tesseract e.g. Elucidate is good for short passages, but do you want to correct errors across an entire ocred book? Abbyy Finereader might work better.

* Sometimes ocr merges columns in 2-column or 3-column view. Sometimes ocr separates columns in tables. The more it avoids one error, the more it's likely to run into the other. Processing before ocr makes text recognition errors more likely, but processing with k2pdfopt might make column recognition errors less likely. I haven't tested this fix.

* If the original format isn't important, and if the ligature bug gets fixed, then extracting the text and manually re-inserting pictures and tables may be a workable fix. I haven't gotten this working yet though.

* In the case of Internet Archive texts, there're usually epub and/or txt versions as well as the pdf version.
MarjaE is offline   Reply With Quote