View Single Post
Old 12-25-2016, 05:12 PM   #5
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Okay.

Some of these have their own imperfect text layers. Splitting or compressing the documents often results in losing the text layers. (I use pdf toolkit+)

Some of these come from the Internet Archive and have ocr'd text versions. The big problems are that the ocr can screw up tables, can misread figures, and of course, can misread ordinary words. So I've needed either pdf or djvu for comparison. Some don't have text versions.

If I can extract the text layer, then spell-checkers could help with the minor errors, the substitution of punctuation for letters, etc., in English-language docs. Not so useful with the major errors. (I would prefer NeoOffice to LibreOffice for this, but neither can find and replace hyphen-breaks or extra line breaks, so I'd probably need Calibre's editing tools too.)

If I can find, excerpt, and re-compress the relevant tables, I could perhaps use two versions, one a pdf with the tables, and the other an epub or mobi with the text. (I would keep using pdf toolkit+)
MarjaE is offline   Reply With Quote