Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 10-13-2014, 10:03 AM   #1
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
ReadablePDF

I wrote a little shell script to help post-process scanned PDFs. I actually wrote it purely for improving usability on a regular computer, but it has the side-effect of making such scans very usable on ereaders as well.

It was written for Linux, but it should probably also work on BSD, Mac, and Windows with Cygwin provided you install and compile some packages.

https://github.com/Frenzie/readablepdf

All of the hard work is done by ScanTailor. This script automates some of the pre- and post-processing you'd need to work with ScanTailor.
Frenzie is offline   Reply With Quote
Advert
Old 10-16-2014, 09:42 AM   #2
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 948
Karma: 6379999
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Looks interesting. Can you post some examples?
willus is offline   Reply With Quote
Old 10-16-2014, 04:47 PM   #3
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Sure thing. Here's an example of the kind of scanned article I have to deal with and the output after pulling it through the script.* In this case the 600 DPI is actually rather nice; typically it's 300 DPI and sometimes an abysmal 150 or 200 DPI.

How long the (manual) ScanTailor step takes depends on a combination of the quality of the scan, the success of its automatic algorithms and your goals. Getting something significantly more usable (my goal) can usually be done rather quickly; obtaining a proper digital representation of the work (margins, location of content) will obviously require significantly more work.

OCR was not a goal when I wrote this. It's just that it'd be really stupid not to add it in automatically. That being said, the ScanTailor processing makes for some really impressive results.

Not counting the time I spent writing this post, I'd say it takes about a minute or two provided autodetection went well. (Usually articles are a little longer than seven pages.) Note that I mean a minute of my time; your computer will be busy for at least several minutes.

* Like this:
Code:
$ readablepdf -l nld Lievens\ V6903000247_20140402125113015-001.pdf
That being said, OCR results are impressive even if the language is set to the default English.

Last edited by Frenzie; 10-16-2014 at 04:49 PM. Reason: Added parenthetical note about length of articles.
Frenzie is offline   Reply With Quote
Old 12-30-2014, 10:15 PM   #4
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
Wow, that looks incredible. I've never had such quite nice luck with ScanTailor or, really, any OCR on linux. Unfortunately, I get lots of errors on Arch... Will have to keep playing with it. Either way, thanks!
x9kf2r is offline   Reply With Quote
Old 01-02-2015, 01:11 PM   #5
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Perhaps there should be some kind of flag for DjVu output using e.g. djvubind. I don't think I mentioned it above, but easy sharing of the results is the main reason I bothered with jbig2enc and pdfbeads. Or is there something else you're having trouble with?
Frenzie is offline   Reply With Quote
Advert
Old 01-02-2015, 09:06 PM   #6
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
Tesseract doesn't seem to pick up the proper language unless I rename /usr/share/tessdata/eng.traineddata to */osd.traineddata. Putting "export TESSDATA_PREFIX" in .bashrc doesn't appear to have any effect as far as I can tell. I've also tried placing the language argument within your script, feeding it by command line, and in ~/.config/readablepdf.conf, all to the same effect.

My current problem, however, is with pdfbeads. Though I don't think I follow the error:

Code:
/usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require': cannot load such file -- iconv (LoadError)
	from /usr/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55:in `require'
	from /usr/lib/ruby/gems/2.1.0/gems/pdfbeads-1.1.1/bin/pdfbeads:35:in `<top (required)>'
	from /usr/bin/pdfbeads:23:in `load'
	from /usr/bin/pdfbeads:23:in `<main>'
Thanks again. Any help is appreciated.
x9kf2r is offline   Reply With Quote
Old 01-09-2015, 05:32 AM   #7
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
A simple (sudo) gem install iconv should do the trick. pdfbeads has some minor issues like that, unfortunately. I'd have preferred to investigate them at the source, but Rubyforge closed down last year and there doesn't seem to be a new location for it as far as I've been able to tell with a quick search.

PS Also see what I wrote here.
Frenzie is offline   Reply With Quote
Old 01-16-2015, 03:15 PM   #8
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment). Once I got the dependency issues cleared away, the results are impressive when it works. Pages come out looking like what one finds on Google Books or on Amazon in keeping with the examples posted above. Size tends to be reduced to 25% or less of original. It works best with double or single page books that have been scanned via copier. Overall, very nice!

A few things I've noticed:
* Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little.
* PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end.
* A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though.
* ScanTailor hammers my computer (3 year i3 with 4 GB ram).
* Grayscale or color images are converted to b/w.

Thanks again. Great to have this script.
x9kf2r is offline   Reply With Quote
Old 01-23-2015, 05:34 AM   #9
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Quote:
Originally Posted by x9kf2r View Post
Thanks, and sorry for the slow reply (dissertation deadline eclipsed the rest of life for a moment).
We all have such things in our lives.

Quote:
* Autorotate (or leaving the rotate tag off) seems to have a mind of its own, rotating unnecessarily or too much or too little.
Leaving the rotate flag off shouldn't do any rotating at all, but of course PDFs can have orientation settings separate from the contained images. That autorotate is a bit flawed is to be expected.

Quote:
* PDFs that contain any OCR already frequently cause problems. The temp folder fills with image fragments (e.g., thousands of close-ups of copier debris or blurry negatives of pages). I usually kill the process after an hour or so, so I'm not sure what comes out the other end.
PDFs that were already processed in some manner isn't really something I can think up any heuristics for. In that case I'd regard the script as executable documentation, meaning in this case that you can simply grab the relevant commands from the script while performing some manual cleanup in between.

Quote:
* A couple random PDFs were reassembled out of order. Not sure what happened; glad I caught it though.
That's worrisome and really shouldn't happen. Have you been able to ascertain at what step of the process things go wrong?

Quote:
* ScanTailor hammers my computer (3 year i3 with 4 GB ram).
Not a word I'd use unless it interfered with normal operations. I doubt my nearly six-year-old Phenom II X4 955 is any faster.

Quote:
* Grayscale or color images are converted to b/w.
That can be adjusted in ScanTailor in the output step.
Frenzie is offline   Reply With Quote
Old 01-31-2015, 02:34 PM   #10
x9kf2r
Junior Member
x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.x9kf2r has a complete set of Star Wars action figures.
 
Posts: 5
Karma: 320
Join Date: Jan 2011
Device: Nook STR
An update

I figure it's worth posting an update, since I've probably run over a hundred PDFs through this in the last weeks. From a class a few years ago, I ended up with around 15 PDFs on the order of 50-130 MB each. These were full color scans of text at some absurd resolution, with all sorts of scanning artifacts, variously screwed and warped, facing pages, two columns per page with long marginal glosses in a different type when citing medieval texts, and with my professor's own annoying marginalia. Over the years, I've tried before running these through ABBYY, Nitro PDF, gscan2pdf, pdfsandwich, and a number of other Linux utilities with the normal back ends. Nothing has done very well, so these PDFs were high on the list of things to jettison when next cleaning out my HDD. That said, with this script, they look neat and clean. The OCR is very accurate for the English text, except where there are long passages of italics; not too terrible with the Latin; downright comic with the Greek. I might try playing around more with the language setting to see if I can improve this, but the results are quite good for my needs. As to size: each is now between about 1-3 MB. I'm honestly amazed with the quality.

Quote:
Originally Posted by Frenzie View Post
PDFs that were already processed in some manner isn't really something I can think up any heuristics for. In that case I'd regard the script as executable documentation, meaning in this case that you can simply grab the relevant commands from the script while performing some manual cleanup in between.
Being code illiterate, I unfortunately can't really follow the details of your script. Though I previously thought this was due to botched OCR attempts, I'm now less sure that is the case. I've had maybe six PDFs (scanned by an inter-library loan system) in which the images the script extracts are a series of blurred images of text, negative images of the same, and blank white pages for each page of the PDF. I've tried inverting the negatives; then removing the other files from ScanTailor, but this step seems to be ignored when I continue the script.

Quote:
Originally Posted by Frenzie View Post
That's worrisome and really shouldn't happen. Have you been able to ascertain at what step of the process things go wrong?
No, but I suspect it relates to that PDF's inconsistent formatting, being two facing pages primarily, except for the beginning and ends of chapters, which were just one page. Maybe one in ten PDFs will do something unexpected. Frequently such issues can be easily fixed using other utilities, e.g., the disordering, pages that weren't cropped, extra blank pages inserted. A couple times, random pages were upside down, which confuses the OCR a bit; sometimes the algorithm that divides double-page PDFs into individual pages will miss a handful of random pages. However, the only real problem I've had is that I've had a few PDFs come out missing pages, but otherwise look perfect. I haven't been able to figure out why. Because of this, I now double check that page numbers are as expected, especially with more important documents.

Quote:
Originally Posted by Frenzie View Post
Not a word I'd use unless it interfered with normal operations. I doubt my nearly six-year-old Phenom II X4 955 is any faster.
Point taken. =)

Quote:
Originally Posted by Frenzie View Post
That can be adjusted in ScanTailor in the output step.
Do you mean to run the script through the image extraction step where it asks to verify the ScanTailor file, then to edit that file before continuing the script? When I've tried that, the script seems to ignore any changes. I must be missing something.

Anyhow, thanks again. I do quite appreciate this script.
x9kf2r is offline   Reply With Quote
Old 02-06-2015, 05:40 AM   #11
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Quote:
Originally Posted by x9kf2r View Post
Do you mean to run the script through the image extraction step where it asks to verify the ScanTailor file, then to edit that file before continuing the script? When I've tried that, the script seems to ignore any changes. I must be missing something.
You need to tell ScanTailor to generate new output images by pressing the play button next to output.

Alternatively I suppose you could insert something like this after the question, which should have the same results (untested). However, given that you can perform a quick visual inspection of the results in ScanTailor I don't think there'd be any advantage to doing so.
Code:
scantailor-cli ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor ${TMP_DIR}
Frenzie is offline   Reply With Quote
Old 11-26-2015, 11:54 AM   #12
idoit
hub
idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.idoit ought to be getting tired of karma fortunes by now.
 
idoit's Avatar
 
Posts: 714
Karma: 2151032
Join Date: Jan 2012
Location: Iranian in Canada
Device: K3G, DXG, Kobo mini
This looks very interesting to me. Thanks for bringing it to me.
idoit is offline   Reply With Quote
Old 11-27-2015, 02:40 AM   #13
Frenzie
Addict
Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.Frenzie is the king of the Divan.
 
Posts: 304
Karma: 106200
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Cheers.
Frenzie is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump


All times are GMT -4. The time now is 09:57 PM.


MobileRead.com is a privately owned, operated and funded community.