MobileRead Forums - View Single Post

tomsem · 01-20-2023, 06:22 PM

I recently got a CZUR ET18 Pro overhead scanner:

https://www.amazon.com/gp/product/B07JMTPJ8S/

Mostly I wanted to be able to scan my favorite sheet music and music instructional material so I could use it on my iPad and Mac. I was mostly able to get decent enough results with the included scanning. It's able to handle the largest page sizes for the music I have 2 pages at at time. There's no need to OCR and make a fully navigable PDF out of it (I have an iPad app that's can to some degree, interpret sheet music and export MusicXML and MIDI).

A lot of what I've done would have been easier with external tools, and going forward with future music material, I'll be applying what I've learned. The available cropping and editing tools are rudimentary.

I have a few books lying around which are out of print or not available as ebooks, and I'd like to read them in digital form. This is promoted as a fast way to scan these, so I've been trying that out, and integrating Photoshop and Acrobat into the toolchain. I'm in the middle of finishing with the first one, about 400 pages long, plus about 48 photos mixed in.

The software takes unprocessed images, then applies rules according to the option you've specified: Single Page, Facing Pages, etc. Facing pages will try to find the center 'seam', remove page curl, and generate an image for each page. One scan takes about a second, and you can use a foot switch, a button on your desk, the scanning software, or Auto mode to trigger capture (it has a microphone, I think you might be able to use voice too).

Before moving to the next steps, it's prudent to review each page to make sure 1) you didn't skip pages or 2) you didn't get a good capture. Some errors can be corrected without re-scanning (e.g. failure to find the 'seam', or you need to apply greyscale or color rules rather than B&W - it generates new images from the unprocessed ones). Then you can easily insert missing pages or replace the bad captures.

In this case the photos were turn of the 20th century and had poor exposure (or chemical aging of the source material) the swaths of black reflected the downward pointing LED lights. So before shooting those, there's another set of LEDs lower down that illuminate from a different angle, without causing reflections in the image to turn on for these situations.

Even with the best technique there are significant deviations in the dimensions, positioning, and rotation of the page in the image. The cropping algorithm with facing pages case is appropriately conservative, though it might be good if you could make it a little less so. So each individual page has to be repositioned and cropped; some need minor rotation adjustments.

For small jobs you can get by with the built in tools, but it's not productive for larger ones.

Hence Photoshop. There is a Script called Load Files into Stack... This places a sequence of images into a sequence of layers. Setting the target Canvas size lets me drag the images around so they're more uniformly positioned; guide lines help position common page elements like headers and margins, and adjust rotations to make things straighter. So by having only one layer visible at a time lets me work through each page and get it ready for bundling in a PDF (the book I'm working on has a lot of paragraph styles, footnotes and hand-written drawings, so it would more work to produce an ePub than just replicating the book).

When done, Export Layers to Files, and load into Acrobat and take another pass through all of the pages. It tries (with varying degrees of success) to separate images and text, and straighten text blocks. Blemishes can become objects that you can just delete.

Finally, generate page labels (roman, numeric, other styles can be mixed), and create page links for TOC and index).

Ta Da! No sweat! (er, actually a lot of sweat)

Even without any change to toolchain, I figure it will take me maybe 25% the time for the next book project.

I'll probably create a macro to move though the list o f layers and toggle visibility as it does so.

And since I can imagine writing a script that does 99% of the PDF page links I'd otherwise have to do manually, I'll be looking for one, or trying to write one.

And there has to be a better PDF tool than Acrobat.

It's very inconsistent about object identification. It sometimes takes a contiguous illustration and leaves parts of it on a page wide object so you can't freely re-position it. Sometimes the object consists of a union of all the elements on the page, lumping header/body/footer so you can't reposition those independently.

Every time I want to create a page from a JPG I have to change the default format from PDF.

When I Replace a page, there seems to be no way to trigger its object recognition so I can make adjustments, short of OCRing the entire document.

And I don't yet know how OCR compares with other products, including AABBYY, which the scanning software includes. If the target is ePub (or even fixed layout ePub) then Acrobat need not apply.

I'm not quite as ready to throw Photoshop under the bus. Importing and exporting layers like this is pretty slow for some reason, but at least it works.

Looking ahead, I'm also planning a screen capture based tool chain, which should more be amendable to scripting.

01-20-2023, 06:22 PM	#1
tomsem Grand Sorcerer Posts: 7,007 Karma: 27060353 Join Date: Apr 2009 Location: USA Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3	Book Scanning tool chains I recently got a CZUR ET18 Pro overhead scanner: https://www.amazon.com/gp/product/B07JMTPJ8S/ Mostly I wanted to be able to scan my favorite sheet music and music instructional material so I could use it on my iPad and Mac. I was mostly able to get decent enough results with the included scanning. It's able to handle the largest page sizes for the music I have 2 pages at at time. There's no need to OCR and make a fully navigable PDF out of it (I have an iPad app that's can to some degree, interpret sheet music and export MusicXML and MIDI). A lot of what I've done would have been easier with external tools, and going forward with future music material, I'll be applying what I've learned. The available cropping and editing tools are rudimentary. I have a few books lying around which are out of print or not available as ebooks, and I'd like to read them in digital form. This is promoted as a fast way to scan these, so I've been trying that out, and integrating Photoshop and Acrobat into the toolchain. I'm in the middle of finishing with the first one, about 400 pages long, plus about 48 photos mixed in. The software takes unprocessed images, then applies rules according to the option you've specified: Single Page, Facing Pages, etc. Facing pages will try to find the center 'seam', remove page curl, and generate an image for each page. One scan takes about a second, and you can use a foot switch, a button on your desk, the scanning software, or Auto mode to trigger capture (it has a microphone, I think you might be able to use voice too). Before moving to the next steps, it's prudent to review each page to make sure 1) you didn't skip pages or 2) you didn't get a good capture. Some errors can be corrected without re-scanning (e.g. failure to find the 'seam', or you need to apply greyscale or color rules rather than B&W - it generates new images from the unprocessed ones). Then you can easily insert missing pages or replace the bad captures. In this case the photos were turn of the 20th century and had poor exposure (or chemical aging of the source material) the swaths of black reflected the downward pointing LED lights. So before shooting those, there's another set of LEDs lower down that illuminate from a different angle, without causing reflections in the image to turn on for these situations. Even with the best technique there are significant deviations in the dimensions, positioning, and rotation of the page in the image. The cropping algorithm with facing pages case is appropriately conservative, though it might be good if you could make it a little less so. So each individual page has to be repositioned and cropped; some need minor rotation adjustments. For small jobs you can get by with the built in tools, but it's not productive for larger ones. Hence Photoshop. There is a Script called Load Files into Stack... This places a sequence of images into a sequence of layers. Setting the target Canvas size lets me drag the images around so they're more uniformly positioned; guide lines help position common page elements like headers and margins, and adjust rotations to make things straighter. So by having only one layer visible at a time lets me work through each page and get it ready for bundling in a PDF (the book I'm working on has a lot of paragraph styles, footnotes and hand-written drawings, so it would more work to produce an ePub than just replicating the book). When done, Export Layers to Files, and load into Acrobat and take another pass through all of the pages. It tries (with varying degrees of success) to separate images and text, and straighten text blocks. Blemishes can become objects that you can just delete. Finally, generate page labels (roman, numeric, other styles can be mixed), and create page links for TOC and index). Ta Da! No sweat! (er, actually a lot of sweat) Even without any change to toolchain, I figure it will take me maybe 25% the time for the next book project. I'll probably create a macro to move though the list o f layers and toggle visibility as it does so. And since I can imagine writing a script that does 99% of the PDF page links I'd otherwise have to do manually, I'll be looking for one, or trying to write one. And there has to be a better PDF tool than Acrobat. It's very inconsistent about object identification. It sometimes takes a contiguous illustration and leaves parts of it on a page wide object so you can't freely re-position it. Sometimes the object consists of a union of all the elements on the page, lumping header/body/footer so you can't reposition those independently. Every time I want to create a page from a JPG I have to change the default format from PDF. When I Replace a page, there seems to be no way to trigger its object recognition so I can make adjustments, short of OCRing the entire document. And I don't yet know how OCR compares with other products, including AABBYY, which the scanning software includes. If the target is ePub (or even fixed layout ePub) then Acrobat need not apply. I'm not quite as ready to throw Photoshop under the bus. Importing and exporting layers like this is pretty slow for some reason, but at least it works. Looking ahead, I'm also planning a screen capture based tool chain, which should more be amendable to scripting. Last edited by tomsem; 01-20-2023 at 08:04 PM.