![]() |
#1 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 63
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->KoboForma&Likebook Ares->Palma2
|
Optimize PDFs from archive.org for E-Ink devices
The internet archive at archive has a lot of interesting books for borrowing and downloading. I have some downloads of older books, that are difficult to read on E-Ink devices because they include the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders. I would prefer to use a commandline on a Linux based system, if such a tool is available here.
An example of the PDFs I am looking at is this: https://archive.org/details/smtliche...ge/n8/mode/2up (This is the item page, the download link is here ![]() https://archive.org/download/smtlich...r16goet_bw.pdf Any help appreciated, Ctop |
![]() |
![]() |
![]() |
#2 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Scan Tailor Advanced: https://github.com/4lex4/scantailor-advanced There isn't another tool like it. If you want commandline, then there's nothing better than ImageMagick, but you'll have to come up with all the tweaks yourself. There was also "What’s your “image rehab” routine?" from 2013 which discussed some image cleanup ideas. Although that mostly focused on cleaning up images within scans. Side Note: Archive.org's B&W versions are usually okay. In this case, it requires lots of manual intervention. Go back to the color PDF (or like GrannyGrump mentions in the thread above, use the original JPEG2000 files), and do all your cleaning from there. This specific file also has a lot of bleeding through the pages, so that may make your job extra harder when trying to darken text. Quote:
Last edited by Tex2002ans; 02-25-2020 at 11:57 PM. |
||
![]() |
![]() |
![]() |
#3 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 63
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->KoboForma&Likebook Ares->Palma2
|
Quote:
All the best, Ctop |
|
![]() |
![]() |
![]() |
#4 |
Unicycle Daredevil
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,944
Karma: 185432100
Join Date: Jan 2011
Location: Planet of the Pudding Brains
Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3
|
Why not fix the epub and upload it to the MR library? Will be much nicer on your reader, and also a service to the community.
![]() |
![]() |
![]() |
![]() |
#5 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 63
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->KoboForma&Likebook Ares->Palma2
|
|
![]() |
![]() |
![]() |
#6 |
Unicycle Daredevil
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,944
Karma: 185432100
Join Date: Jan 2011
Location: Planet of the Pudding Brains
Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3
|
It is a lot of work, no denying that. But your pdf-fixing efforts sound pretty complicated too, so that's what gave me the idea.
I only now had a look at the book you have in mind. That's huuuge, of course, and seriously a lot of work. BTW, there's a very nice epub edition of Goethe's works in our library, provided by pynch. But I'm not sure if the scientific writngs are complete in that one. |
![]() |
![]() |
![]() |
#7 |
Unicycle Daredevil
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,944
Karma: 185432100
Join Date: Jan 2011
Location: Planet of the Pudding Brains
Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3
|
Just had a look at the txt file of the book - a very clean OCR result with surprisingly few errors. Fixing the epub may really be the way to go here.
|
![]() |
![]() |
![]() |
#8 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But if you're using it for personal copies, or a pre-processor for more accurate OCR, it's great. The nice thing about it is you can also do page-by-page adjustments, and see how the final output will look. For example, speckle cleanup is fantastic, and you can see the diffs and adjust the strength if necessary. Quote:
Scan Tailor Advanced combines all the best functionality from all of them, and I believe it's the only one actively maintained. Quote:
I usually just stick with their: 1. B&W PDF. Usually this is decent. In the case of this specific "yellowed book", it was crap. 2. Color PDF. This matches what they show in their online reader. Helpful if working with color, drawings, or "yellowed books". (You can do your own contrast/color corrections from this, and create a better grayscale/B&W version.) 3. As a last resort, work directly from the JPEG2000 images. These are the highest resolution/quality. Do not touch their "EPUBs" or any of their other "ebook" formats (they are just automatically run through OCR, no proofing or anything). You're better off working from the source files and recreating your own OCR/ebooks from that. Plus, if you have access to newer tools, you may get even more accurate conversion (according to the metadata, Finereader 8 was used, where Finereader 12+ is probably more accurate). PS. If you need me to run any images/PDFs (pre-processed or not) through Finereader 12, just let me know. You can always automate any pre-processing steps with ImageMagick. For example, I was working on a book with scanning artifacts that ran vertically through the text: Detecting/Removing Vertical Scanlines from Scans So it could be used to clean up the images, then run through further corrections/tools after. But with ImageMagick... you'll have to spend time figuring out all the commands + recreating fixes that may already exist. For example, Scan Tailor already does a fantastic job of dewarping, detecting and cropping spines+edges-of-pages, [...]. If you go pure commandline ImageMagick... you'll have to figure out all those algorithms on your own. (Plus each book is going to have its own unique challenges.) Last edited by Tex2002ans; 02-26-2020 at 06:51 PM. |
|||
![]() |
![]() |
![]() |
#9 | |
Running with scissors
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,581
Karma: 14328510
Join Date: Nov 2019
Device: none
|
Quote:
I've also done it using the txt file and depending on the quality of the scan and the original book it can be a painful amount of work. |
|
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,856
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
|
I suggest you try koreader. It contains ocr and reflow capacity on the fly. It also has contrast.
On another note, does anyone know a pdf tool that can ocr text that curves up at the end of a line as a result of the edge of a book page not being flat when scanned? |
![]() |
![]() |
![]() |
#11 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Convert the PDF into PNG or TIFF images, run Scan Tailor on them, then go back to PDF. |
|
![]() |
![]() |
![]() |
#12 | ||
Unicycle Daredevil
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,944
Karma: 185432100
Join Date: Jan 2011
Location: Planet of the Pudding Brains
Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3
|
Quote:
Quote:
![]() |
||
![]() |
![]() |
![]() |
#13 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
k2pdfopt -mode fitwidth -bpc 2 -n- -ls- -ac example1.pdf If you want to try it on just a few pages first, add something like: -p 1-40 Example conversion of pages 30-39 is attached. The only thing is that the file size of the converted PDF will be even bigger because the original is actually very well compressed (fitting 900 bitmapped pages into 30 MB is no small trick--it uses JPEG 2000 JPX compression, whereas k2pdfopt converts it to .png lossless compression, which is not as compact). I used -bpc 2 to get the converted file size down a little. Last edited by willus; 02-27-2020 at 10:21 PM. |
|
![]() |
![]() |
![]() |
#14 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
I grabbed this book and ran it through Scan Tailor Advanced + Finereader 12. 1. Finereader 12 did a MUCH better job with colored PDF's "yellowed pages", and had no issues creating a B&W version. I attached it as [Finereader][BW]. (You can see how much better 12 converts compared to 8.) Side Note: I manually erased markings from the first few pages, so they look pure white... just ignore that in your comparisons. ![]() Note: Alternatively, you could've fed color images into Scan Tailor directly (it has 3 different methods to convert to B&W/Grayscale + you can mess with the thresholds). 2. I exported the B&W PDF into PNGs, then ran that through Scan Tailor Advanced. I spent about an hour going through the various stages, and Scan Tailor did a FANTASTIC job at automatically picking all correct boxes. The page edges are nearly all gone. I would say 95%+ I didn't have to touch at all. Side Note: Despeckling + Outputting has gotten so much faster/better compared to how it used to be. And I only had to use Despeckling on a handful of pages to remove the occasional stray dots. (Being able to see the before/afters marked with red is an enormous help. This is one step where GUI beats the pants off of pure commandline.) 3. I took the Scan Tailor images, and reimported them into Finereader 12, ran OCR, and output as: PDF = [ScanTailor][Finereader][BW].pdf. (30 MBs is too large to attach, so here's a download.) EPUB = [Finereader].epub. You can compare the text, and see how much more accurate 12 is compared to Archive.org's "EPUB". (Most importantly, the headers+page numbers are nearly all automatically removed and not clogging the text.) 4. I took Finereader's EPUB and ran it through my usual "Finereader cleanup Regex": Attached it as [Finereader][CodeCleanup].epub. Comparison Images Archive.org Color PDF + Finereader B&W + Scan Tailor Cleanup: Last edited by Tex2002ans; 02-28-2020 at 03:14 AM. |
||
![]() |
![]() |
![]() |
#15 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 63
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->KoboForma&Likebook Ares->Palma2
|
Quote:
And one more question, since I like to highlight things in my PDFs, is the text layer the same as before, or does k2pdfopt do its own OCR? All the best, Ctop |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
archive.org downloads | abrogard | Calibre | 2 | 08-11-2018 06:08 PM |
Archive.org | crutledge | General Discussions | 129 | 08-28-2015 06:22 AM |
do you try to optimize for different devices? | sarah_pnix | ePub | 5 | 02-16-2011 05:05 AM |
PDFs are blank when dled from archive.org | rakista | enTourage Archive | 1 | 05-16-2010 09:58 AM |
Archive.org copyright question | Hatgirl | General Discussions | 7 | 03-23-2010 07:58 PM |