Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 10-13-2014, 10:03 AM   #1
Frenzie
Connoisseur
Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.
 
Posts: 64
Karma: 61370
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
ReadablePDF

I wrote a little shell script to help post-process scanned PDFs. I actually wrote it purely for improving usability on a regular computer, but it has the side-effect of making such scans very usable on ereaders as well.

It was written for Linux, but it should probably also work on BSD, Mac, and Windows with Cygwin provided you install and compile some packages.

https://github.com/Frenzie/readablepdf

All of the hard work is done by ScanTailor. This script automates some of the pre- and post-processing you'd need to work with ScanTailor.
Frenzie is offline   Reply With Quote
Old 10-16-2014, 09:42 AM   #2
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 607
Karma: 2556287
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Looks interesting. Can you post some examples?
willus is offline   Reply With Quote
Old 10-16-2014, 04:47 PM   #3
Frenzie
Connoisseur
Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.
 
Posts: 64
Karma: 61370
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Sure thing. Here's an example of the kind of scanned article I have to deal with and the output after pulling it through the script.* In this case the 600 DPI is actually rather nice; typically it's 300 DPI and sometimes an abysmal 150 or 200 DPI.

How long the (manual) ScanTailor step takes depends on a combination of the quality of the scan, the success of its automatic algorithms and your goals. Getting something significantly more usable (my goal) can usually be done rather quickly; obtaining a proper digital representation of the work (margins, location of content) will obviously require significantly more work.

OCR was not a goal when I wrote this. It's just that it'd be really stupid not to add it in automatically. That being said, the ScanTailor processing makes for some really impressive results.

Not counting the time I spent writing this post, I'd say it takes about a minute or two provided autodetection went well. (Usually articles are a little longer than seven pages.) Note that I mean a minute of my time; your computer will be busy for at least several minutes.

* Like this:
Code:
$ readablepdf -l nld Lievens\ V6903000247_20140402125113015-001.pdf
That being said, OCR results are impressive even if the language is set to the default English.

Last edited by Frenzie; 10-16-2014 at 04:49 PM. Reason: Added parenthetical note about length of articles.
Frenzie is offline   Reply With Quote
Old 12-30-2014, 10:15 PM   #4
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 4
Karma: 320
Join Date: Jan 2011
Device: Nook STR
Wow, that looks incredible. I've never had such quite nice luck with ScanTailor or, really, any OCR on linux. Unfortunately, I get lots of errors on Arch... Will have to keep playing with it. Either way, thanks!
x9kf2r is offline   Reply With Quote
Old 01-02-2015, 01:11 PM   #5
Frenzie
Connoisseur
Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.
 
Posts: 64
Karma: 61370
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Perhaps there should be some kind of flag for DjVu output using e.g. djvubind. I don't think I mentioned it above, but easy sharing of the results is the main reason I bothered with jbig2enc and pdfbeads. Or is there something else you're having trouble with?
Frenzie is offline   Reply With Quote
Old 01-02-2015, 09:06 PM   #6
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 4
Karma: 320
Join Date: Jan 2011
Device: Nook STR
Tesseract doesn't seem to pick up the proper language unless I rename /usr/share/tessdata/eng.traineddata to */osd.traineddata. Putting "export TESSDATA_PREFIX" in .bashrc doesn't appear to have any effect as far as I can tell. I've also tried placing the language argument within your script, feeding it by command line, and in ~/.config/readablepdf.conf, all to the same effect.

My current problem, however, is with pdfbeads. Though I don't think I follow the error:

Code:
/usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require': cannot load such file -- iconv (LoadError)
	from /usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require'
	from /usr/lib/ruby/gems/2.1.0/gems/pdfbeads-1.1.1/bin/pdfbeads:35:in `<top (required)>'
	from /usr/bin/pdfbeads:23:in `load'
	from /usr/bin/pdfbeads:23:in `<main>'
Thanks again. Any help is appreciated.
x9kf2r is offline   Reply With Quote
Old 01-09-2015, 05:32 AM   #7
Frenzie
Connoisseur
Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.
 
Posts: 64
Karma: 61370
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
A simple (sudo) gem install iconv should do the trick. pdfbeads has some minor issues like that, unfortunately. I'd have preferred to investigate them at the source, but Rubyforge closed down last year and there doesn't seem to be a new location for it as far as I've been able to tell with a quick search.

PS Also see what I wrote here.
Frenzie is offline   Reply With Quote
Old 01-16-2015, 03:15 PM   #8
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 4
Karma: 320
Join Date: Jan 2011
Device: Nook STR
Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment). Once I got the dependency issues cleared away, the results are impressive when it works. Pages come out looking like what one finds on Google Books or on Amazon in keeping with the examples posted above. Size tends to be reduced to 25% or less of original. It works best with double or single page books that have been scanned via copier. Overall, very nice!

A few things I've noticed:
* Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little.
* PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end.
* A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though.
* ScanTailor hammers my computer (3 year i3 with 4 GB ram).
* Grayscale or color images are converted to b/w.

Thanks again. Great to have this script.
x9kf2r is offline   Reply With Quote
Old 01-23-2015, 05:34 AM   #9
Frenzie
Connoisseur
Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.Frenzie has a propeller beanie that spins backward.
 
Posts: 64
Karma: 61370
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Quote:
Originally Posted by x9kf2r View Post
Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment).
We all have such things in our lives.

Quote:
* Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little.
Leaving the rotate flag off shouldn't do any rotating at all, but of course PDFs can have orientation settings separate from the contained images. That autorotate is a bit flawed is to be expected.

Quote:
* PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end.
PDFs that were already processed in some manner isn't really something I can think up any heuristics for. In that case I'd regard the script as executable documentation, meaning in this case that you can simply grab the relevant commands from the script while performing some manual cleanup in between.

Quote:
* A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though.
That's worrisome and really shouldn't happen. Have you been able to ascertain at what step of the process things go wrong?

Quote:
* ScanTailor hammers my computer (3 year i3 with 4 GB ram).
Not a word I'd use unless it interfered with normal operations. I doubt my nearly six-year-old Phenom II X4 955 is any faster.

Quote:
* Grayscale or color images are converted to b/w.
That can be adjusted in ScanTailor in the output step.
Frenzie is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump


All times are GMT -4. The time now is 12:08 AM.


MobileRead.com is a privately owned, operated and funded community.