|
|
#16 | |
|
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 63
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->KoboForma&Likebook Ares->Palma2
|
Quote:
Looking forward to your blog:-) All the best, Ctop |
|
|
|
|
|
|
#17 | ||
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
No, not that I remember (probably a good post to stick on the blog. I'll add it to my notes.). I haven't changed my "Finereader Regex" in years... it's really just taking ~12 of Finereader's inline styles: Code:
<span style="font-style:italic;"> <span style="font-weight:bold;"> <span style="font-variant:small-caps;"> <span style="font-weight:bold;font-variant:small-caps;"> [...] Code:
<span class="italics"> <span class="bold"> <span class="smallcaps"> <span class="smallcaps"> (smallcaps + bold is always an OCR/formatting error) [...] Code:
<i> <b> 2017 "Converting PDF/HTML to ereader formats" 2016 "Delete paragraphs in scanned books (S & R with regexes)" It gives me a very clean base to work from, and then I can focus on the actual formatting/markup issues. Me too, me too. I need to kick my butt into gear and get that blog up and running. (That's one of my new years resolutions!)Then I could just say "Here's everything I ever wrote on ImageMagick". In January, I used it to cleanup 3000+ pages of journal articles. There were scanning artifacts along the top/right edges, plus dirt/smudge in the bottom right corners: So I used ImageMagick to: 1. Crop ### pixels from the edges. 2. Fill ### pixels with white. but as I said, every PDF is going to bring unique challenges... a handful of random pages turned out like this: and required further intervention. (Open the actual image and take a look. The MobileRead thumbnail looks okay, but you'll see the innards are actually multiple overlapping transparent boxes.) And other pages, the headers/footers were too close to the edge, so my solution disappeared text. Without comparing the before/after closely, I would've never known certain text was clipped/missing. Again, why I stress a GUI is helpful. ![]() Side Note: Here was another ImageMagick thread from a few months ago: Converting pdf to png images where I showed how to remove speckles + crop (the original's white margins were absolutely immense). Last edited by Tex2002ans; 02-28-2020 at 06:59 PM. |
||
|
|
|
| Advert | |
|
|
|
|
#18 | |
|
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
By default, k2pdfopt keeps the OCR layer from the source PDF, but it can also do its own OCR. Last edited by willus; 02-28-2020 at 10:52 PM. |
|
|
|
|
|
|
#19 | |
|
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
k2pdfopt -mode tm -c- -rt 0 -cmax -8 -bpc 2 -n- -dr 2 -ac yellowed.pdf -o processed.pdf The "-mode tm" does the one page per page bit. "-cmax -8" to use a fixed contrast adjust on each page (the default auto adjust isn't aggressive enough for this document--you could probably stand to go even higher than 8). The "-dr 2" bumps up the output resolution by a factor of two over what it would normally be for a Kindle Paperwhite. Example input and output attached. Last edited by willus; 02-29-2020 at 08:11 PM. |
|
|
|
|
|
|
#20 | |
|
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
BTW, k2pdfopt can drop out PNG files for each page rather than a PDF: k2pdfopt -mode tm -c- -rt 0 -cmax -8 -n- -dr 2 -ac yellowed.pdf -o out.png PS. Gave you (Tex2002ans) a shout out on my PDF Conversion web page. Last edited by willus; 03-01-2020 at 01:58 PM. |
|
|
|
|
| Advert | |
|
|
|
|
#21 | |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,847
Karma: 105494725
Join Date: Apr 2011
Device: pb360
|
Quote:
|
|
|
|
|
|
|
#22 |
|
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
|
|
|
|
|
|
#23 | |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,847
Karma: 105494725
Join Date: Apr 2011
Device: pb360
|
Quote:
https://www.mobileread.com/forums/sh...d.php?t=178155 Some scripts working with the mask images are in the first attachment. The images (of pages of text) in at least some archive.org PDF files are combinations of 2 PBM images and a PGM image. One of the PBM images is the mask. I discovered this when I ran a utility to extract images from a PDF and have no idea how the PDF standard addresses this or how PDF libraries and utilities make use of the mask images. Last edited by j.p.s; 03-01-2020 at 02:38 PM. |
|
|
|
|
|
|
#24 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,874
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
|
|
|
|
|
|
|
#25 | |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
They've been around for a long time, similar to MobileRead, and focus on a lot of book scanning/cleanup. Tons of good information there. And here's the "Scan Tailor Advanced" topic where 4lex4 posts. Last edited by Tex2002ans; 03-05-2020 at 05:07 PM. |
|
|
|
|
|
|
#26 | |
|
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Apr 2025
Device: PW4
|
Quote:
I tried Scan Tailor Advanced and is great, here is the best tutorial -from Tefl-Dude- I found to know how to use it and get to the final product -also using two more apps, all free-. I'll leave it here in case someone needs it https://www.youtube.com/watch?v=IM1EqJ3MCII Also you can download the most recent versions from here Windows https://github.com/ScanTailor-Advanc...anced/releases Mac OS https://github.com/yb85/scantailor-a...d-osx/releases |
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| archive.org downloads | abrogard | Calibre | 2 | 08-11-2018 07:08 PM |
| Archive.org | crutledge | General Discussions | 129 | 08-28-2015 07:22 AM |
| do you try to optimize for different devices? | sarah_pnix | ePub | 5 | 02-16-2011 06:05 AM |
| PDFs are blank when dled from archive.org | rakista | enTourage Archive | 1 | 05-16-2010 10:58 AM |
| Archive.org copyright question | Hatgirl | General Discussions | 7 | 03-23-2010 08:58 PM |