02-28-2020, 06:52 AM | #16 | |
Connoisseur
Posts: 57
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->Kobo Forma&Likebook Ares
|
Quote:
Looking forward to your blog:-) All the best, Ctop |
|
02-28-2020, 05:33 PM | #17 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
No, not that I remember (probably a good post to stick on the blog. I'll add it to my notes.). I haven't changed my "Finereader Regex" in years... it's really just taking ~12 of Finereader's inline styles: Code:
<span style="font-style:italic;"> <span style="font-weight:bold;"> <span style="font-variant:small-caps;"> <span style="font-weight:bold;font-variant:small-caps;"> [...] Code:
<span class="italics"> <span class="bold"> <span class="smallcaps"> <span class="smallcaps"> (smallcaps + bold is always an OCR/formatting error) [...] Code:
<i> <b> 2017 "Converting PDF/HTML to ereader formats" 2016 "Delete paragraphs in scanned books (S & R with regexes)" It gives me a very clean base to work from, and then I can focus on the actual formatting/markup issues. Me too, me too. I need to kick my butt into gear and get that blog up and running. (That's one of my new years resolutions!) Then I could just say "Here's everything I ever wrote on ImageMagick". In January, I used it to cleanup 3000+ pages of journal articles. There were scanning artifacts along the top/right edges, plus dirt/smudge in the bottom right corners: So I used ImageMagick to: 1. Crop ### pixels from the edges. 2. Fill ### pixels with white. but as I said, every PDF is going to bring unique challenges... a handful of random pages turned out like this: and required further intervention. (Open the actual image and take a look. The MobileRead thumbnail looks okay, but you'll see the innards are actually multiple overlapping transparent boxes.) And other pages, the headers/footers were too close to the edge, so my solution disappeared text. Without comparing the before/after closely, I would've never known certain text was clipped/missing. Again, why I stress a GUI is helpful. Side Note: Here was another ImageMagick thread from a few months ago: Converting pdf to png images where I showed how to remove speckles + crop (the original's white margins were absolutely immense). Last edited by Tex2002ans; 02-28-2020 at 05:59 PM. |
||
02-28-2020, 09:48 PM | #18 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
By default, k2pdfopt keeps the OCR layer from the source PDF, but it can also do its own OCR. Last edited by willus; 02-28-2020 at 09:52 PM. |
|
02-29-2020, 07:00 PM | #19 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
k2pdfopt -mode tm -c- -rt 0 -cmax -8 -bpc 2 -n- -dr 2 -ac yellowed.pdf -o processed.pdf The "-mode tm" does the one page per page bit. "-cmax -8" to use a fixed contrast adjust on each page (the default auto adjust isn't aggressive enough for this document--you could probably stand to go even higher than 8). The "-dr 2" bumps up the output resolution by a factor of two over what it would normally be for a Kindle Paperwhite. Example input and output attached. Last edited by willus; 02-29-2020 at 07:11 PM. |
|
02-29-2020, 07:07 PM | #20 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
BTW, k2pdfopt can drop out PNG files for each page rather than a PDF: k2pdfopt -mode tm -c- -rt 0 -cmax -8 -n- -dr 2 -ac yellowed.pdf -o out.png PS. Gave you (Tex2002ans) a shout out on my PDF Conversion web page. Last edited by willus; 03-01-2020 at 12:58 PM. |
|
02-29-2020, 07:19 PM | #21 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
|
|
03-01-2020, 12:57 PM | #22 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
|
03-01-2020, 01:35 PM | #23 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
https://www.mobileread.com/forums/sh...d.php?t=178155 Some scripts working with the mask images are in the first attachment. The images (of pages of text) in at least some archive.org PDF files are combinations of 2 PBM images and a PGM image. One of the PBM images is the mask. I discovered this when I ran a utility to extract images from a PDF and have no idea how the PDF standard addresses this or how PDF libraries and utilities make use of the mask images. Last edited by j.p.s; 03-01-2020 at 01:38 PM. |
|
03-05-2020, 12:22 PM | #24 |
Wizard
Posts: 2,827
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
|
|
03-05-2020, 04:05 PM | #25 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
They've been around for a long time, similar to MobileRead, and focus on a lot of book scanning/cleanup. Tons of good information there. And here's the "Scan Tailor Advanced" topic where 4lex4 posts. Last edited by Tex2002ans; 03-05-2020 at 04:07 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
archive.org downloads | abrogard | Calibre | 2 | 08-11-2018 06:08 PM |
Archive.org | crutledge | General Discussions | 129 | 08-28-2015 06:22 AM |
do you try to optimize for different devices? | sarah_pnix | ePub | 5 | 02-16-2011 05:05 AM |
PDFs are blank when dled from archive.org | rakista | enTourage Archive | 1 | 05-16-2010 09:58 AM |
Archive.org copyright question | Hatgirl | General Discussions | 7 | 03-23-2010 07:58 PM |