View Single Post
Old 11-05-2021, 07:52 PM   #46
ownedbycats
Custom User Title
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 8,644
Karma: 61234567
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
Quote:
Originally Posted by retiredbiker View Post
The OCR Internet Archive uses is so bad, it is easier to do it myself rather than try and correct any of their text formats, especially magazines. Even scraping the text off a pdf is really full of errors. . So I start with one of the cbr or cbz files. If only a pdf is available, I will use pdfimages to get the images out and use those.

For the OCR, I use tesseract with a GUI front end called OCRFeeder (on Linux). On each page I select the text I want and then recognise it. This lets me do multi-column magazines, avoid advertisements, deal with "continued on page 161", and so on. I copy each column of text into LO Writer. It's pretty fast, for a two-column magazine page I average about 50 seconds for the select-recognise-copy-paste part. At the end I convert the odt file from Writer to epub using Calibre, and touch it up in the editor.

OCRFeeder does a great job of finding correct paragraphs, dealing with end-of-line hyphens, and so on, so there is very little detail formatting needed. I've a handful of saved styles in Writer for chapter headings, notes or letters or signs in the text, poetry and so on.

Of course there are scannos--proofreading the result is the most time-consuming part of the work, by far. The clarity of the original print job, and the image, determine how many errors you get. A really, really clear image of excellent printing might give 1 or 2 errors per page, but if the print is very blurry and there are lots of dirty marks, it could be 100 per page. So some source files I'll look at, and say "no thanks" on that one.

Labour intensive, yes, but it's a hobby. I might spend 4 or 5 days on an issue of something like Dime Detective, with maybe 8 stories, 75,000 words and 12 illustrations.
Also, what is up with Internet Archive's PDF compression? It rarely renders correctly on my Kobo, and instead I just get an image of text smudges. Even on my PC it's slow to render.

It doesn't matter much now because they semi-recently removed the option to ADE-download (and thus sideload) most of their Open Library books, but it still affects the public-domain stuff.

Last edited by ownedbycats; 11-06-2021 at 01:27 AM.
ownedbycats is offline   Reply With Quote