Quote:
Originally Posted by ctop
Thank you amazing work. This is now really a pleasure to read on my Ares.
|
No problem.
Quote:
Originally Posted by ctop
My takeaway is that it really pays to invest the time to use Scantailor. Especially the removal of the page headers is great.
|
It's pretty good.
Quote:
Originally Posted by ctop
Did you describe the regexes you are using somewhere?
|
No, not that I remember (probably a good post to stick on the blog. I'll add it to my notes.).
I haven't changed my "Finereader Regex" in years... it's really just taking ~12 of Finereader's inline styles:
Code:
<span style="font-style:italic;">
<span style="font-weight:bold;">
<span style="font-variant:small-caps;">
<span style="font-weight:bold;font-variant:small-caps;">
[...]
and changing to CSS:
Code:
<span class="italics">
<span class="bold">
<span class="smallcaps">
<span class="smallcaps"> (smallcaps + bold is always an OCR/formatting error)
[...]
Then I just do an extra step to convert to:
I discussed some more of the process in:
2017 "Converting PDF/HTML to ereader formats"
2016 "Delete paragraphs in scanned books (S & R with regexes)"
It gives me a very clean base to work from, and then I can focus on the actual formatting/markup issues.
Quote:
Originally Posted by ctop
Looking forward to your blog:-)
|
Me too, me too.
I need to kick my butt into gear and get that blog up and running.

(That's one of my new years resolutions!)
Then I could just say "Here's everything I ever wrote on ImageMagick".
In January, I used it to cleanup 3000+ pages of journal articles. There were scanning artifacts along the top/right edges, plus dirt/smudge in the bottom right corners:
So I used ImageMagick to:
1. Crop ### pixels from the edges.
2. Fill ### pixels with white.
but as I said, every PDF is going to bring unique challenges... a handful of random pages turned out like this:
and required further intervention. (Open the actual image and take a look. The MobileRead thumbnail looks okay, but you'll see the innards are actually multiple overlapping transparent boxes.)
And other pages, the headers/footers were too close to the edge, so my solution disappeared text. Without comparing the before/after closely, I would've never known certain text was clipped/missing. Again, why I stress a GUI is helpful.
Side Note: Here was another ImageMagick thread from a few months ago:
Converting pdf to png images
where I showed how to remove speckles + crop (the original's white margins were absolutely immense).