MobileRead Forums - View Single Post - Opinions on Archive.org as free ebook source?

retiredbiker · 06-23-2024, 03:22 PM

The Internet Archive PDFs have become most of the raw material for my retirement hobby: getting pulp magazine and similar stories (and whole issues) into good, illustrated epub books. Work yes, but with good tools I can often do a 70,000-word magazine issue or novel in a few days. (The first one I ever did probably took two months!)

General method, if anyone wants to try. I do all this on a Linux box, but most everything or an equivalent is probably available on any platform:

Find the PDF I want, something like Dime Detective January 1950 just as an example. (If the book is "borrow for an hour", you might need https://financial-accounting-acg2021.../download.html to help get the pdf. The Calibre de-ACSM plugin works for this without Adobe.)

Pull out usable images with pdftopng (https://poppler.freedesktop.org)

If the page images are really bad, which is rare, clean them up with ImageMagick or Scan Taylor Advanced. https://imagemagick.org/script/download.php and/or https://github.com/4lex4/scantailor-advanced

Run OCR page by page, doing each column one at a time, avoiding ads and following "continued on page nnn" instructions: Tesseract OCR using OCRFeeder front end. This sounds horrible but is actually quite quick, several pages a minute with practice. https://wiki.gnome.org/action/show/Apps/OCRFeeder

Take the page-by-page text output and copy it into LibreOffice Writer. I have made a template, .ott file, with about 15 custom styles to handle most all needs for fiction of this type.

I use GIMP to extract and fix up images I want in the book. Scale large images to max 1200 px in largest dimension since target is e-ink readers. In Writer, anchor images "as characters" and keep it simple.

Proofread and correct the Writer document, formatting it with the custom styles. I do this maybe 20 or 30 pages at a time. Each input doc seems to give repeated OCR errors due to typesetting and/or scanning, so this allows find-and-replace corrections in large chunks. (I wish Writer had saved searches like Sigil or Calibre Editor.)

Import the .odt file into Sigil using the new version of the ODT Import plugin--it is just terrific. I have a custom css file in the plugin that exactly matches the styles in my Writer template file, but dimensions in em instead of pt or cm. Just delete the default css and the custom one takes over like magic. This setup is in the manual.

This input plugin works so well that the only work needed at the epub code level is a little fix-up on images, 3 or 4 standard code changes I like, and adding metadata. Depending on the doc, I use Sigil or the Calibre Editor for this--they have different tools, both are very good. I almost never have to touch the actual coding of any book text.

No one-button conversion is ever going to get you a book like this from old, multi-column text full of advertisements!

Remaining problem, what to do with the resulting books? Most of the stuff I do is really old, but I'm not about to try and understand the copyright issues. You can probably find some of this if you search in likely places.

06-23-2024, 03:22 PM	#20
retiredbiker Evangelist Posts: 450 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	How to make a good EPUB from an IA PDF The Internet Archive PDFs have become most of the raw material for my retirement hobby: getting pulp magazine and similar stories (and whole issues) into good, illustrated epub books. Work yes, but with good tools I can often do a 70,000-word magazine issue or novel in a few days. (The first one I ever did probably took two months!) General method, if anyone wants to try. I do all this on a Linux box, but most everything or an equivalent is probably available on any platform: Find the PDF I want, something like Dime Detective January 1950 just as an example. (If the book is "borrow for an hour", you might need https://financial-accounting-acg2021.../download.html to help get the pdf. The Calibre de-ACSM plugin works for this without Adobe.) Pull out usable images with pdftopng (https://poppler.freedesktop.org) If the page images are really bad, which is rare, clean them up with ImageMagick or Scan Taylor Advanced. https://imagemagick.org/script/download.php and/or https://github.com/4lex4/scantailor-advanced Run OCR page by page, doing each column one at a time, avoiding ads and following "continued on page nnn" instructions: Tesseract OCR using OCRFeeder front end. This sounds horrible but is actually quite quick, several pages a minute with practice. https://wiki.gnome.org/action/show/Apps/OCRFeeder Take the page-by-page text output and copy it into LibreOffice Writer. I have made a template, .ott file, with about 15 custom styles to handle most all needs for fiction of this type. I use GIMP to extract and fix up images I want in the book. Scale large images to max 1200 px in largest dimension since target is e-ink readers. In Writer, anchor images "as characters" and keep it simple. Proofread and correct the Writer document, formatting it with the custom styles. I do this maybe 20 or 30 pages at a time. Each input doc seems to give repeated OCR errors due to typesetting and/or scanning, so this allows find-and-replace corrections in large chunks. (I wish Writer had saved searches like Sigil or Calibre Editor.) Import the .odt file into Sigil using the new version of the ODT Import plugin--it is just terrific. I have a custom css file in the plugin that exactly matches the styles in my Writer template file, but dimensions in em instead of pt or cm. Just delete the default css and the custom one takes over like magic. This setup is in the manual. This input plugin works so well that the only work needed at the epub code level is a little fix-up on images, 3 or 4 standard code changes I like, and adding metadata. Depending on the doc, I use Sigil or the Calibre Editor for this--they have different tools, both are very good. I almost never have to touch the actual coding of any book text. No one-button conversion is ever going to get you a book like this from old, multi-column text full of advertisements! Remaining problem, what to do with the resulting books? Most of the stuff I do is really old, but I'm not about to try and understand the copyright issues. You can probably find some of this if you search in likely places.