MobileRead Forums - View Single Post - Scanning Paperbacks with Yellow Background

kbaerwald · 02-01-2012, 09:16 AM

Taking into account all your good advices I processed one of my paperbacks with a less yellow cast: a fantasy novel with 526 pages (175.000 words) first published in 1946. For those who hate to destroy books I have to say that I own a more precious hardcover version which I would NOT handle that way.

The cutting is done in a simple way: cover is carefully removed, the book divided into 100 page chunks. The chunks are fixed on a workbench with screw clamps alongside of two wooden bars. With a sharp knife I cut off approx. 5mm of the spine. Done.

Looks like these two workflows are usable for me:

Target pdf

- Cutting the paperback and feeding into the Canon P-150: 25 pages at a time into the feeder, specs are roughly 10 double-pages per minute with 600 dpi grayscale which is very fast. A 500 page book is done in 30 min with handling, recans etc.
- Scanning grayscale with variable brightness; on a scale between 0-254 it is something between 160-200 depending on the yellowness of the book. The contrast is 10-20% more than average. This gives the OCR software enough room for interpretation and does not cut off too much information. Fortunately the scanner also allows for deskewing and other adjustments. It is always better to do these things at the very beginning of the workflow. Results are 600 dpi grayscales tifs, roughly adjusted but still with "jumping" layout.
- Tifs are then processed into Scan Tailor (excellent programme, already donated for it): in the end the text ist a centred b&w tif which can be directly handled by Adobe Acrobat or FineReader.
- Reading quality is acceptable to good on my eReader.

The whole process is realtively fast and yields a readable 700 page pdf ebook within a few hours.

Target epub/mobi

- The first two steps are the same but the tifs are then directly processed by FineReader 11
- The OCR quality of FineReader is really good: I compared that with Omnipage and tesseract (my favourite for German Fraktur text). Output is rtf with basic formatting (headings, body text with italics)
- I prefer MS Word for further processing: removing hidden and wrong hyphenations, correcting false OCR characters and CRs. Most of the time FineReader delivers a correct chaptering. Other word processors would be suitable too but I am experienced with WfW for some years.
- The finishing will be done with Jutoh (which is described elsewhere in this forum) with an epub/mobi output.

This workflow takes somewhat longer because of OCR: the 700 page example took me 4 hours of finalizing after the scanning. This pretty much depends on the quality and age of the book template. What comes out of it is a structured and easily readable eBook without much designing. Just enough for my fiction books.

I hope to get some more insights to share with you during the processing of next 300 paperbacks.

Klaus

02-01-2012, 09:16 AM	#10
kbaerwald BioReader Posts: 292 Karma: 42568 Join Date: Apr 2009 Location: Germany Device: Various	Taking into account all your good advices I processed one of my paperbacks with a less yellow cast: a fantasy novel with 526 pages (175.000 words) first published in 1946. For those who hate to destroy books I have to say that I own a more precious hardcover version which I would NOT handle that way. The cutting is done in a simple way: cover is carefully removed, the book divided into 100 page chunks. The chunks are fixed on a workbench with screw clamps alongside of two wooden bars. With a sharp knife I cut off approx. 5mm of the spine. Done. Looks like these two workflows are usable for me: Target pdf - Cutting the paperback and feeding into the Canon P-150: 25 pages at a time into the feeder, specs are roughly 10 double-pages per minute with 600 dpi grayscale which is very fast. A 500 page book is done in 30 min with handling, recans etc. - Scanning grayscale with variable brightness; on a scale between 0-254 it is something between 160-200 depending on the yellowness of the book. The contrast is 10-20% more than average. This gives the OCR software enough room for interpretation and does not cut off too much information. Fortunately the scanner also allows for deskewing and other adjustments. It is always better to do these things at the very beginning of the workflow. Results are 600 dpi grayscales tifs, roughly adjusted but still with "jumping" layout. - Tifs are then processed into Scan Tailor (excellent programme, already donated for it): in the end the text ist a centred b&w tif which can be directly handled by Adobe Acrobat or FineReader. - Reading quality is acceptable to good on my eReader. The whole process is realtively fast and yields a readable 700 page pdf ebook within a few hours. Target epub/mobi - The first two steps are the same but the tifs are then directly processed by FineReader 11 - The OCR quality of FineReader is really good: I compared that with Omnipage and tesseract (my favourite for German Fraktur text). Output is rtf with basic formatting (headings, body text with italics) - I prefer MS Word for further processing: removing hidden and wrong hyphenations, correcting false OCR characters and CRs. Most of the time FineReader delivers a correct chaptering. Other word processors would be suitable too but I am experienced with WfW for some years. - The finishing will be done with Jutoh (which is described elsewhere in this forum) with an epub/mobi output. This workflow takes somewhat longer because of OCR: the 700 page example took me 4 hours of finalizing after the scanning. This pretty much depends on the quality and age of the book template. What comes out of it is a structured and easily readable eBook without much designing. Just enough for my fiction books. I hope to get some more insights to share with you during the processing of next 300 paperbacks. Klaus