MobileRead Forums - View Single Post

x9kf2r · 01-31-2015, 01:34 PM

I figure it's worth posting an update, since I've probably run over a hundred PDFs through this in the last weeks. From a class a few years ago, I ended up with around 15 PDFs on the order of 50-130 MB each. These were full color scans of text at some absurd resolution, with all sorts of scanning artifacts, variously screwed and warped, facing pages, two columns per page with long marginal glosses in a different type when citing medieval texts, and with my professor's own annoying marginalia. Over the years, I've tried before running these through ABBYY, Nitro PDF, gscan2pdf, pdfsandwich, and a number of other Linux utilities with the normal back ends. Nothing has done very well, so these PDFs were high on the list of things to jettison when next cleaning out my HDD. That said, with this script, they look neat and clean. The OCR is very accurate for the English text, except where there are long passages of italics; not too terrible with the Latin; downright comic with the Greek. I might try playing around more with the language setting to see if I can improve this, but the results are quite good for my needs. As to size: each is now between about 1-3 MB. I'm honestly amazed with the quality.

Quote:

Originally Posted by Frenzie

PDFs that were already processed in some manner isn't really something I can think up any heuristics for. In that case I'd regard the script as executable documentation, meaning in this case that you can simply grab the relevant commands from the script while performing some manual cleanup in between.

Being code illiterate, I unfortunately can't really follow the details of your script. Though I previously thought this was due to botched OCR attempts, I'm now less sure that is the case. I've had maybe six PDFs (scanned by an inter-library loan system) in which the images the script extracts are a series of blurred images of text, negative images of the same, and blank white pages for each page of the PDF. I've tried inverting the negatives; then removing the other files from ScanTailor, but this step seems to be ignored when I continue the script.

Quote:

Originally Posted by Frenzie

That's worrisome and really shouldn't happen. Have you been able to ascertain at what step of the process things go wrong?

No, but I suspect it relates to that PDF's inconsistent formatting, being two facing pages primarily, except for the beginning and ends of chapters, which were just one page. Maybe one in ten PDFs will do something unexpected. Frequently such issues can be easily fixed using other utilities, e.g., the disordering, pages that weren't cropped, extra blank pages inserted. A couple times, random pages were upside down, which confuses the OCR a bit; sometimes the algorithm that divides double-page PDFs into individual pages will miss a handful of random pages. However, the only real problem I've had is that I've had a few PDFs come out missing pages, but otherwise look perfect. I haven't been able to figure out why. Because of this, I now double check that page numbers are as expected, especially with more important documents.

Quote:

Originally Posted by Frenzie

Not a word I'd use unless it interfered with normal operations.

I doubt my nearly six-year-old Phenom II X4 955 is any faster.

Point taken. =)

Quote:

Originally Posted by Frenzie

That can be adjusted in ScanTailor in the output step.

Do you mean to run the script through the image extraction step where it asks to verify the ScanTailor file, then to edit that file before continuing the script? When I've tried that, the script seems to ignore any changes. I must be missing something.

Anyhow, thanks again. I do quite appreciate this script.