![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
ReadablePDF
I wrote a little shell script to help post-process scanned PDFs. I actually wrote it purely for improving usability on a regular computer, but it has the side-effect of making such scans very usable on ereaders as well.
It was written for Linux, but it should probably also work on BSD, Mac, and Windows with Cygwin provided you install and compile some packages. https://github.com/Frenzie/readablepdf All of the hard work is done by ScanTailor. This script automates some of the pre- and post-processing you'd need to work with ScanTailor. |
![]() |
![]() |
![]() |
#2 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,302
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Looks interesting. Can you post some examples?
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Sure thing. Here's an example of the kind of scanned article I have to deal with and the output after pulling it through the script.* In this case the 600 DPI is actually rather nice; typically it's 300 DPI and sometimes an abysmal 150 or 200 DPI.
How long the (manual) ScanTailor step takes depends on a combination of the quality of the scan, the success of its automatic algorithms and your goals. Getting something significantly more usable (my goal) can usually be done rather quickly; obtaining a proper digital representation of the work (margins, location of content) will obviously require significantly more work. OCR was not a goal when I wrote this. It's just that it'd be really stupid not to add it in automatically. That being said, the ScanTailor processing makes for some really impressive results. Not counting the time I spent writing this post, I'd say it takes about a minute or two provided autodetection went well. (Usually articles are a little longer than seven pages.) Note that I mean a minute of my time; your computer will be busy for at least several minutes. * Like this: Code:
$ readablepdf -l nld Lievens\ V6903000247_20140402125113015-001.pdf Last edited by Frenzie; 10-16-2014 at 03:49 PM. Reason: Added parenthetical note about length of articles. |
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() ![]() ![]() ![]() Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
|
Wow, that looks incredible. I've never had such quite nice luck with ScanTailor or, really, any OCR on linux. Unfortunately, I get lots of errors on Arch... Will have to keep playing with it. Either way, thanks!
|
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Perhaps there should be some kind of flag for DjVu output using e.g. djvubind. I don't think I mentioned it above, but easy sharing of the results is the main reason I bothered with jbig2enc and pdfbeads. Or is there something else you're having trouble with?
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() ![]() ![]() ![]() Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
|
Tesseract doesn't seem to pick up the proper language unless I rename /usr/share/tessdata/eng.traineddata to */osd.traineddata. Putting "export TESSDATA_PREFIX" in .bashrc doesn't appear to have any effect as far as I can tell. I've also tried placing the language argument within your script, feeding it by command line, and in ~/.config/readablepdf.conf, all to the same effect.
My current problem, however, is with pdfbeads. Though I don't think I follow the error: Code:
/usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require': cannot load such file -- iconv (LoadError) from /usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require' from /usr/lib/ruby/gems/2.1.0/gems/pdfbeads-1.1.1/bin/pdfbeads:35:in `<top (required)>' from /usr/bin/pdfbeads:23:in `load' from /usr/bin/pdfbeads:23:in `<main>' |
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
A simple (sudo) gem install iconv should do the trick. pdfbeads has some minor issues like that, unfortunately. I'd have preferred to investigate them at the source, but Rubyforge closed down last year and there doesn't seem to be a new location for it as far as I've been able to tell with a quick search.
PS Also see what I wrote here. |
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() ![]() ![]() ![]() Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
|
Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment). Once I got the dependency issues cleared away, the results are impressive when it works. Pages come out looking like what one finds on Google Books or on Amazon in keeping with the examples posted above. Size tends to be reduced to 25% or less of original. It works best with double or single page books that have been scanned via copier. Overall, very nice!
A few things I've noticed: * Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little. * PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end. * A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though. * ScanTailor hammers my computer (3 year i3 with 4 GB ram). * Grayscale or color images are converted to b/w. Thanks again. Great to have this script. |
![]() |
![]() |
![]() |
#9 | ||||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Quote:
![]() Quote:
![]() Quote:
Quote:
Quote:
![]() Quote:
|
||||||
![]() |
![]() |
![]() |
#10 | |||
Junior Member
![]() ![]() ![]() ![]() Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
|
An update
I figure it's worth posting an update, since I've probably run over a hundred PDFs through this in the last weeks. From a class a few years ago, I ended up with around 15 PDFs on the order of 50-130 MB each. These were full color scans of text at some absurd resolution, with all sorts of scanning artifacts, variously screwed and warped, facing pages, two columns per page with long marginal glosses in a different type when citing medieval texts, and with my professor's own annoying marginalia. Over the years, I've tried before running these through ABBYY, Nitro PDF, gscan2pdf, pdfsandwich, and a number of other Linux utilities with the normal back ends. Nothing has done very well, so these PDFs were high on the list of things to jettison when next cleaning out my HDD. That said, with this script, they look neat and clean. The OCR is very accurate for the English text, except where there are long passages of italics; not too terrible with the Latin; downright comic with the Greek. I might try playing around more with the language setting to see if I can improve this, but the results are quite good for my needs. As to size: each is now between about 1-3 MB. I'm honestly amazed with the quality.
Quote:
Quote:
Quote:
Do you mean to run the script through the image extraction step where it asks to verify the ScanTailor file, then to edit that file before continuing the script? When I've tried that, the script seems to ignore any changes. I must be missing something. Anyhow, thanks again. I do quite appreciate this script. |
|||
![]() |
![]() |
![]() |
#11 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Quote:
Alternatively I suppose you could insert something like this after the question, which should have the same results (untested). However, given that you can perform a quick visual inspection of the results in ScanTailor I don't think there'd be any advantage to doing so. Code:
scantailor-cli ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor ${TMP_DIR} |
|
![]() |
![]() |
![]() |
#12 |
hub
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 715
Karma: 2151032
Join Date: Jan 2012
Location: Iranian in Canada
Device: K3G, DXG, Kobo mini
|
This looks very interesting to me. Thanks for bringing it to me.
|
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,745
Karma: 730681
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Cheers.
![]() |
![]() |
![]() |