Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 02-28-2020, 06:52 AM   #16
ctop
Connoisseur
ctop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and gracectop herds cats with both ease and grace
 
Posts: 57
Karma: 43710
Join Date: Jun 2008
Device: zaurus->palm->iPad->Sony PRS-T1,T2,T3->Kobo Forma&Likebook Ares
Quote:
Originally Posted by Tex2002ans View Post

But if you want to take steps in making the PDF a proper ebook:

I grabbed this book and ran it through Scan Tailor Advanced + Finereader 12.

[...]

You can compare the text, and see how much more accurate 12 is compared to Archive.org's "EPUB". (Most importantly, the headers+page numbers are nearly all automatically removed and not clogging the text.)

4. I took Finereader's EPUB and ran it through my usual "Finereader cleanup Regex":

Attached it as [Finereader][CodeCleanup].epub.
Thank you amazing work. This is now really a pleasure to read on my Ares. My takeaway is that it really pays to invest the time to use Scantailor. Especially the removal of the page headers is great. Did you describe the regexes you are using somewhere?

Looking forward to your blog:-)

All the best,

Ctop
ctop is offline   Reply With Quote
Old 02-28-2020, 05:33 PM   #17
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by ctop View Post
Thank you amazing work. This is now really a pleasure to read on my Ares.
No problem.

Quote:
Originally Posted by ctop View Post
My takeaway is that it really pays to invest the time to use Scantailor. Especially the removal of the page headers is great.
It's pretty good.

Quote:
Originally Posted by ctop View Post
Did you describe the regexes you are using somewhere?
No, not that I remember (probably a good post to stick on the blog. I'll add it to my notes.).

I haven't changed my "Finereader Regex" in years... it's really just taking ~12 of Finereader's inline styles:

Code:
<span style="font-style:italic;">
<span style="font-weight:bold;">
<span style="font-variant:small-caps;">
<span style="font-weight:bold;font-variant:small-caps;">
[...]
and changing to CSS:

Code:
<span class="italics">
<span class="bold">
<span class="smallcaps">
<span class="smallcaps"> (smallcaps + bold is always an OCR/formatting error)
[...]
Then I just do an extra step to convert to:

Code:
<i>
<b>
I discussed some more of the process in:

2017 "Converting PDF/HTML to ereader formats"
2016 "Delete paragraphs in scanned books (S & R with regexes)"

It gives me a very clean base to work from, and then I can focus on the actual formatting/markup issues.

Quote:
Originally Posted by ctop View Post
Looking forward to your blog:-)
Me too, me too.

I need to kick my butt into gear and get that blog up and running. (That's one of my new years resolutions!)

Then I could just say "Here's everything I ever wrote on ImageMagick".

In January, I used it to cleanup 3000+ pages of journal articles. There were scanning artifacts along the top/right edges, plus dirt/smudge in the bottom right corners:

Click image for larger version

Name:	2_4_3-11[Orig].png
Views:	355
Size:	97.1 KB
ID:	177427

So I used ImageMagick to:

1. Crop ### pixels from the edges.
2. Fill ### pixels with white.

Click image for larger version

Name:	2_4_3-11-crop.png
Views:	357
Size:	96.0 KB
ID:	177428

but as I said, every PDF is going to bring unique challenges... a handful of random pages turned out like this:

Click image for larger version

Name:	2_4_8-7[Orig].png
Views:	380
Size:	176.4 KB
ID:	177429

and required further intervention. (Open the actual image and take a look. The MobileRead thumbnail looks okay, but you'll see the innards are actually multiple overlapping transparent boxes.)

And other pages, the headers/footers were too close to the edge, so my solution disappeared text. Without comparing the before/after closely, I would've never known certain text was clipped/missing. Again, why I stress a GUI is helpful.

Side Note: Here was another ImageMagick thread from a few months ago:

Converting pdf to png images

where I showed how to remove speckles + crop (the original's white margins were absolutely immense).

Last edited by Tex2002ans; 02-28-2020 at 05:59 PM.
Tex2002ans is offline   Reply With Quote
Old 02-28-2020, 09:48 PM   #18
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by ctop View Post
Wow, this looks really great, exactly what I had in mind! Awesome! One question though, the file you created has the page breaks at different places than the original, which is astonishing. What is the reason for this?

And one more question, since I like to highlight things in my PDFs, is the text layer the same as before, or does k2pdfopt do its own OCR?

All the best,

Ctop
The default behavior of k2pdfopt in "fitwidth" mode is to concatenate pages as it fits them into the converted PDF, and it disregards page breaks in the source document. You can add the -bp option to force a page break in the converted document wherever there is a page break in the source. There are other options that are better if you prefer to have a 1-to-1 source page to converted page correlation. The k2pdfopt options are documented here.

By default, k2pdfopt keeps the OCR layer from the source PDF, but it can also do its own OCR.

Last edited by willus; 02-28-2020 at 09:52 PM.
willus is offline   Reply With Quote
Old 02-29-2020, 07:00 PM   #19
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by willus View Post
... There are other options that are better if you prefer to have a 1-to-1 source page to converted page correlation. The k2pdfopt options are documented here.
I am realizing that in the original post for this thread, there were two different versions of the PDF--the black and white one (which I processed with my original post in this thread), and the yellowed version. To process the yellowed version, I needed to jack up the contrast quite a bit. I've tweaked this set of options to give one output page per input page--simply cropping each input page:

k2pdfopt -mode tm -c- -rt 0 -cmax -8 -bpc 2 -n- -dr 2 -ac yellowed.pdf -o processed.pdf

The "-mode tm" does the one page per page bit. "-cmax -8" to use a fixed contrast adjust on each page (the default auto adjust isn't aggressive enough for this document--you could probably stand to go even higher than 8). The "-dr 2" bumps up the output resolution by a factor of two over what it would normally be for a Kindle Paperwhite. Example input and output attached.
Attached Files
File Type: pdf yellowed.pdf (421.1 KB, 329 views)
File Type: pdf processed.pdf (1.11 MB, 325 views)

Last edited by willus; 02-29-2020 at 07:11 PM.
willus is offline   Reply With Quote
Old 02-29-2020, 07:07 PM   #20
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by Tex2002ans View Post
Fantastic work as always. Yes, if you wanted to keep it in PDF form... your tool is always best. :P
Hey, the OP asked for a command-line linux tool to trim a PDF. Rarely have I had a request so well matched to k2pdfopt!

BTW, k2pdfopt can drop out PNG files for each page rather than a PDF:

k2pdfopt -mode tm -c- -rt 0 -cmax -8 -n- -dr 2 -ac yellowed.pdf -o out.png

PS. Gave you (Tex2002ans) a shout out on my PDF Conversion web page.
Attached Thumbnails
Click image for larger version

Name:	out0001.png
Views:	309
Size:	184.7 KB
ID:	177447  

Last edited by willus; 03-01-2020 at 12:58 PM.
willus is offline   Reply With Quote
Old 02-29-2020, 07:19 PM   #21
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by willus View Post
I am realizing that in the original post for this thread, there were two different versions of the PDF--the black and white one (which I processed with my original post in this thread), and the yellowed version. To process the yellowed version, I needed to jack up the contrast quite a bit. I've tweaked this set of options to give one output page per input page--simply cropping each input page:

k2pdfopt -mode tm -c- -rt 0 -cmax -8 -bpc 2 -n- -dr 2 -ac yellowed.pdf -o processed.pdf

The "-mode tm" does the one page per page bit. "-cmax -8" to use a fixed contrast adjust on each page (the default auto adjust isn't aggressive enough for this document--you could probably stand to go even higher than 8). The "-dr 2" bumps up the output resolution by a factor of two over what it would normally be for a Kindle Paperwhite. Example input and output attached.
One form of archive.org book files have a mask image for each page that can be used to make white regions of the page completely white. Have you ever used those to help clean up the page?
j.p.s is online now   Reply With Quote
Old 03-01-2020, 12:57 PM   #22
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by j.p.s View Post
One form of archive.org book files have a mask image for each page that can be used to make white regions of the page completely white. Have you ever used those to help clean up the page?
Interesting. No--I had not heard of this before.
willus is offline   Reply With Quote
Old 03-01-2020, 01:35 PM   #23
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by j.p.s View Post
One form of archive.org book files have a mask image for each page that can be used to make white regions of the page completely white. Have you ever used those to help clean up the page?
Quote:
Originally Posted by willus View Post
Interesting. No--I had not heard of this before.
It's been a couple of years since I've worked with them, so I'm fuzzy on the details. I mentioned it in passing in post #3 in the thread
https://www.mobileread.com/forums/sh...d.php?t=178155
Some scripts working with the mask images are in the first attachment.

The images (of pages of text) in at least some archive.org PDF files are combinations of 2 PBM images and a PGM image. One of the PBM images is the mask. I discovered this when I ran a utility to extract images from a PDF and have no idea how the PDF standard addresses this or how PDF libraries and utilities make use of the mask images.

Last edited by j.p.s; 03-01-2020 at 01:38 PM.
j.p.s is online now   Reply With Quote
Old 03-05-2020, 12:22 PM   #24
Pajamaman
Wizard
Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.
 
Pajamaman's Avatar
 
Posts: 2,827
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
Quote:
Originally Posted by Tex2002ans View Post
You have to dewarp the images. Scan Tailor Advanced can do that.

Convert the PDF into PNG or TIFF images, run Scan Tailor on them, then go back to PDF.
Gracias

This Scantaylor looks handy. Surprised I never heard of it before.
Pajamaman is offline   Reply With Quote
Old 03-05-2020, 04:05 PM   #25
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Pajamaman View Post
This Scantaylor looks handy. Surprised I never heard of it before.
The DIY Book Scanner forum is where a lot of discussion/support happens.

They've been around for a long time, similar to MobileRead, and focus on a lot of book scanning/cleanup. Tons of good information there.

And here's the "Scan Tailor Advanced" topic where 4lex4 posts.

Last edited by Tex2002ans; 03-05-2020 at 04:07 PM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
archive.org downloads abrogard Calibre 2 08-11-2018 06:08 PM
Archive.org crutledge General Discussions 129 08-28-2015 06:22 AM
do you try to optimize for different devices? sarah_pnix ePub 5 02-16-2011 05:05 AM
PDFs are blank when dled from archive.org rakista enTourage Archive 1 05-16-2010 09:58 AM
Archive.org copyright question Hatgirl General Discussions 7 03-23-2010 07:58 PM


All times are GMT -4. The time now is 07:32 PM.


MobileRead.com is a privately owned, operated and funded community.