ReadablePDF

Frenzie · 10-13-2014, 09:03 AM

I wrote a little shell script to help post-process scanned PDFs. I actually wrote it purely for improving usability on a regular computer, but it has the side-effect of making such scans very usable on ereaders as well.

It was written for Linux, but it should probably also work on BSD, Mac, and Windows with Cygwin provided you install and compile some packages.

https://github.com/Frenzie/readablepdf

All of the hard work is done by ScanTailor. This script automates some of the pre- and post-processing you'd need to work with ScanTailor.

willus · 10-16-2014, 08:42 AM

Looks interesting. Can you post some examples?

Frenzie · 10-16-2014, 03:47 PM

Sure thing. Here's an example of the kind of scanned article I have to deal with and the output after pulling it through the script.* In this case the 600 DPI is actually rather nice; typically it's 300 DPI and sometimes an abysmal 150 or 200 DPI.

How long the (manual) ScanTailor step takes depends on a combination of the quality of the scan, the success of its automatic algorithms and your goals. Getting something significantly more usable (my goal) can usually be done rather quickly; obtaining a proper digital representation of the work (margins, location of content) will obviously require significantly more work.

OCR was not a goal when I wrote this. It's just that it'd be really stupid not to add it in automatically. That being said, the ScanTailor processing makes for some really impressive results.

Not counting the time I spent writing this post, I'd say it takes about a minute or two provided autodetection went well. (Usually articles are a little longer than seven pages.) Note that I mean a minute of my time; your computer will be busy for at least several minutes.

* Like this:

Code:

$ readablepdf -l nld Lievens\ V6903000247_20140402125113015-001.pdf

That being said, OCR results are impressive even if the language is set to the default English.

x9kf2r · 12-30-2014, 09:15 PM

Wow, that looks incredible. I've never had such quite nice luck with ScanTailor or, really, any OCR on linux. Unfortunately, I get lots of errors on Arch... Will have to keep playing with it. Either way, thanks!

Frenzie · 01-02-2015, 12:11 PM

Perhaps there should be some kind of flag for DjVu output using e.g. djvubind. I don't think I mentioned it above, but easy sharing of the results is the main reason I bothered with jbig2enc and pdfbeads. Or is there something else you're having trouble with?

x9kf2r · 01-02-2015, 08:06 PM

Tesseract doesn't seem to pick up the proper language unless I rename /usr/share/tessdata/eng.traineddata to */osd.traineddata. Putting "export TESSDATA_PREFIX" in .bashrc doesn't appear to have any effect as far as I can tell. I've also tried placing the language argument within your script, feeding it by command line, and in ~/.config/readablepdf.conf, all to the same effect.

My current problem, however, is with pdfbeads. Though I don't think I follow the error:

Code:

/usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require': cannot load such file -- iconv (LoadError)
	from /usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require'
	from /usr/lib/ruby/gems/2.1.0/gems/pdfbeads-1.1.1/bin/pdfbeads:35:in `<top (required)>'
	from /usr/bin/pdfbeads:23:in `load'
	from /usr/bin/pdfbeads:23:in `<main>'

Thanks again. Any help is appreciated.

Frenzie · 01-09-2015, 04:32 AM

A simple (sudo) gem install iconv should do the trick. pdfbeads has some minor issues like that, unfortunately. I'd have preferred to investigate them at the source, but Rubyforge closed down last year and there doesn't seem to be a new location for it as far as I've been able to tell with a quick search.

PS Also see what I wrote here.

x9kf2r · 01-16-2015, 02:15 PM

Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment). Once I got the dependency issues cleared away, the results are impressive when it works. Pages come out looking like what one finds on Google Books or on Amazon in keeping with the examples posted above. Size tends to be reduced to 25% or less of original. It works best with double or single page books that have been scanned via copier. Overall, very nice!

A few things I've noticed:
* Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little.
* PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end.
* A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though.
* ScanTailor hammers my computer (3 year i3 with 4 GB ram).
* Grayscale or color images are converted to b/w.

Thanks again. Great to have this script.

Frenzie · 01-23-2015, 04:34 AM

Quote:

Originally Posted by x9kf2r

Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment).

We all have such things in our lives.

Quote:

* Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little.

Leaving the rotate flag off shouldn't do any rotating at all, but of course PDFs can have orientation settings separate from the contained images. That autorotate is a bit flawed is to be expected.

Quote:

* PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end.

PDFs that were already processed in some manner isn't really something I can think up any heuristics for. In that case I'd regard the script as executable documentation, meaning in this case that you can simply grab the relevant commands from the script while performing some manual cleanup in between.

Quote:

* A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though.

That's worrisome and really shouldn't happen. Have you been able to ascertain at what step of the process things go wrong?

Quote:

* ScanTailor hammers my computer (3 year i3 with 4 GB ram).

Not a word I'd use unless it interfered with normal operations.

I doubt my nearly six-year-old Phenom II X4 955 is any faster.

Quote:

* Grayscale or color images are converted to b/w.

That can be adjusted in ScanTailor in the output step.

x9kf2r · 01-31-2015, 01:34 PM

I figure it's worth posting an update, since I've probably run over a hundred PDFs through this in the last weeks. From a class a few years ago, I ended up with around 15 PDFs on the order of 50-130 MB each. These were full color scans of text at some absurd resolution, with all sorts of scanning artifacts, variously screwed and warped, facing pages, two columns per page with long marginal glosses in a different type when citing medieval texts, and with my professor's own annoying marginalia. Over the years, I've tried before running these through ABBYY, Nitro PDF, gscan2pdf, pdfsandwich, and a number of other Linux utilities with the normal back ends. Nothing has done very well, so these PDFs were high on the list of things to jettison when next cleaning out my HDD. That said, with this script, they look neat and clean. The OCR is very accurate for the English text, except where there are long passages of italics; not too terrible with the Latin; downright comic with the Greek. I might try playing around more with the language setting to see if I can improve this, but the results are quite good for my needs. As to size: each is now between about 1-3 MB. I'm honestly amazed with the quality.

Quote:

Originally Posted by Frenzie

PDFs that were already processed in some manner isn't really something I can think up any heuristics for. In that case I'd regard the script as executable documentation, meaning in this case that you can simply grab the relevant commands from the script while performing some manual cleanup in between.

Being code illiterate, I unfortunately can't really follow the details of your script. Though I previously thought this was due to botched OCR attempts, I'm now less sure that is the case. I've had maybe six PDFs (scanned by an inter-library loan system) in which the images the script extracts are a series of blurred images of text, negative images of the same, and blank white pages for each page of the PDF. I've tried inverting the negatives; then removing the other files from ScanTailor, but this step seems to be ignored when I continue the script.

Quote:

Originally Posted by Frenzie

That's worrisome and really shouldn't happen. Have you been able to ascertain at what step of the process things go wrong?

No, but I suspect it relates to that PDF's inconsistent formatting, being two facing pages primarily, except for the beginning and ends of chapters, which were just one page. Maybe one in ten PDFs will do something unexpected. Frequently such issues can be easily fixed using other utilities, e.g., the disordering, pages that weren't cropped, extra blank pages inserted. A couple times, random pages were upside down, which confuses the OCR a bit; sometimes the algorithm that divides double-page PDFs into individual pages will miss a handful of random pages. However, the only real problem I've had is that I've had a few PDFs come out missing pages, but otherwise look perfect. I haven't been able to figure out why. Because of this, I now double check that page numbers are as expected, especially with more important documents.

Quote:

Originally Posted by Frenzie

Not a word I'd use unless it interfered with normal operations.

I doubt my nearly six-year-old Phenom II X4 955 is any faster.

Point taken. =)

Quote:

Originally Posted by Frenzie

That can be adjusted in ScanTailor in the output step.

Do you mean to run the script through the image extraction step where it asks to verify the ScanTailor file, then to edit that file before continuing the script? When I've tried that, the script seems to ignore any changes. I must be missing something.

Anyhow, thanks again. I do quite appreciate this script.

Frenzie · 02-06-2015, 04:40 AM

Quote:

Originally Posted by x9kf2r

Do you mean to run the script through the image extraction step where it asks to verify the ScanTailor file, then to edit that file before continuing the script? When I've tried that, the script seems to ignore any changes. I must be missing something.

You need to tell ScanTailor to generate new output images by pressing the play button next to output.

Alternatively I suppose you could insert something like this after the question, which should have the same results (untested). However, given that you can perform a quick visual inspection of the results in ScanTailor I don't think there'd be any advantage to doing so.

Code:

scantailor-cli ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor ${TMP_DIR}

thatworkshop · 11-26-2015, 10:54 AM

This looks very interesting to me. Thanks for bringing it to me.

Frenzie · 11-27-2015, 01:40 AM

Cheers.

10-13-2014, 09:03 AM	#1
Frenzie Wizard Posts: 1,615 Karma: 724945 Join Date: Oct 2014 Location: Antwerp Device: Kobo Aura H2O	ReadablePDF I wrote a little shell script to help post-process scanned PDFs. I actually wrote it purely for improving usability on a regular computer, but it has the side-effect of making such scans very usable on ereaders as well. It was written for Linux, but it should probably also work on BSD, Mac, and Windows with Cygwin provided you install and compile some packages. https://github.com/Frenzie/readablepdf All of the hard work is done by ScanTailor. This script automates some of the pre- and post-processing you'd need to work with ScanTailor.

01-02-2015, 08:06 PM	#6
x9kf2r Junior Member Posts: 5 Karma: 320 Join Date: Jan 2011 Device: Nook STR	Tesseract doesn't seem to pick up the proper language unless I rename /usr/share/tessdata/eng.traineddata to */osd.traineddata. Putting "export TESSDATA_PREFIX" in .bashrc doesn't appear to have any effect as far as I can tell. I've also tried placing the language argument within your script, feeding it by command line, and in ~/.config/readablepdf.conf, all to the same effect. My current problem, however, is with pdfbeads. Though I don't think I follow the error: Code: /usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require': cannot load such file -- iconv (LoadError) from /usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require' from /usr/lib/ruby/gems/2.1.0/gems/pdfbeads-1.1.1/bin/pdfbeads:35:in `<top (required)>' from /usr/bin/pdfbeads:23:in `load' from /usr/bin/pdfbeads:23:in `<main>' Thanks again. Any help is appreciated.

10-16-2014, 08:42 AM	#2
willus Fuzzball, the purple cat Posts: 1,273 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	Looks interesting. Can you post some examples?

12-30-2014, 09:15 PM	#4
x9kf2r Junior Member Posts: 5 Karma: 320 Join Date: Jan 2011 Device: Nook STR	Wow, that looks incredible. I've never had such quite nice luck with ScanTailor or, really, any OCR on linux. Unfortunately, I get lots of errors on Arch... Will have to keep playing with it. Either way, thanks!

01-02-2015, 12:11 PM	#5
Frenzie Wizard Posts: 1,615 Karma: 724945 Join Date: Oct 2014 Location: Antwerp Device: Kobo Aura H2O	Perhaps there should be some kind of flag for DjVu output using e.g. djvubind. I don't think I mentioned it above, but easy sharing of the results is the main reason I bothered with jbig2enc and pdfbeads. Or is there something else you're having trouble with?

01-09-2015, 04:32 AM	#7
Frenzie Wizard Posts: 1,615 Karma: 724945 Join Date: Oct 2014 Location: Antwerp Device: Kobo Aura H2O	A simple (sudo) gem install iconv should do the trick. pdfbeads has some minor issues like that, unfortunately. I'd have preferred to investigate them at the source, but Rubyforge closed down last year and there doesn't seem to be a new location for it as far as I've been able to tell with a quick search. PS Also see what I wrote here.

01-16-2015, 02:15 PM	#8
x9kf2r Junior Member Posts: 5 Karma: 320 Join Date: Jan 2011 Device: Nook STR	Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment). Once I got the dependency issues cleared away, the results are impressive when it works. Pages come out looking like what one finds on Google Books or on Amazon in keeping with the examples posted above. Size tends to be reduced to 25% or less of original. It works best with double or single page books that have been scanned via copier. Overall, very nice! A few things I've noticed: * Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little. * PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end. * A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though. * ScanTailor hammers my computer (3 year i3 with 4 GB ram). * Grayscale or color images are converted to b/w. Thanks again. Great to have this script.

11-26-2015, 10:54 AM	#12
thatworkshop hub Posts: 715 Karma: 2151032 Join Date: Jan 2012 Location: Iranian in Canada Device: K3G, DXG, Kobo mini	This looks very interesting to me. Thanks for bringing it to me.

11-27-2015, 01:40 AM	#13
Frenzie Wizard Posts: 1,615 Karma: 724945 Join Date: Oct 2014 Location: Antwerp Device: Kobo Aura H2O	Cheers.