Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-27-2011, 03:26 AM   #1
Ceryta
Junior Member
Ceryta began at the beginning.
 
Ceryta's Avatar
 
Posts: 5
Karma: 10
Join Date: Apr 2011
Device: netbook
Do I have to OCR?

I have a newbie question. I am going to start scanning my paperback books soon. I have about 3,000. I don't need to be able to edit the text once I scan it. I did a couple of sample scans and the pages are readable. I am reading on a netbook and not an ereader. So do I have to OCR each of the books? Can I just use the scanned pages as my final product? I have read that the files can very large if you don't use OCR, but how big is big? The average number of pages for my paperback books is 350-400. I will be scanning everything in black/white. All the pages are just plain text. The only images will be the front and back covers. Thank you for any help.
Ceryta is offline   Reply With Quote
Old 04-27-2011, 06:31 AM   #2
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
Size is one reason for OCRing your books. Reflowable text is another reason. If you in the future decide to read on a smaller screen, reflowing the text to shorter lines or bigger fonts could be helpful.

I was once in the same situation, and I scanned my books to jpeg images and generated a cbz-file of the. The method is quite simple: add your files to a zip- or .rar archive, thus compressing them to smaller size and only get one file pr. book. Rename .zip to .cbz. If you compressed to .rar, rename the ending to .cbr. Now you can read them using a Comic Book Reader program.

Of course it is also possible to Read your image files in an image viewer or something. It is also possible to convert to .pdf and read it there. Pdf files need some computing to prosess/load, and big PDFs are difficult to handle on mobile platforms, but perhaps it could be an option on a netbook?

If the need for OCRing arises in the future, it is possible to use the images as input to an OCR-program and get reflowable text out of it. Whether this is necessary as of today or if reading from images is "good enough" depends on your screen size, storage on your netbook and of course on your reading preferances, and only you can answer those questions.

Last edited by Iznogood; 04-27-2011 at 06:49 AM.
Iznogood is offline   Reply With Quote
Old 04-27-2011, 11:48 AM   #3
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,315
Karma: 4963983
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by norway1456 View Post
I scanned my books to jpeg images and generated a cbz-file of the. The method is quite simple: add your files to a zip- or .rar archive, thus compressing them to smaller size
JPEG images are already compressed. Trying to compress them again is going to give very little gain, and could even result in larger files (the same happens if you try to compress MP3 files or DIVX movies).

Quote:
and only get one file pr. book. [...] Now you can read them using a Comic Book Reader program.
These are valid and excellent reasons, however.
Jellby is offline   Reply With Quote
Old 04-27-2011, 12:06 PM   #4
Ceryta
Junior Member
Ceryta began at the beginning.
 
Ceryta's Avatar
 
Posts: 5
Karma: 10
Join Date: Apr 2011
Device: netbook
I was gonna scan to pdf. I have a lot of ebooks already in this formart. It works fine when reading on my netbook. The sample scans I took also look good, very clear and readable. So would scanned pdf files be to large? If they can I make them smaller without OCR? thanks
Ceryta is offline   Reply With Quote
Old 04-27-2011, 12:42 PM   #5
DDHarriman
Guru
DDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheese
 
Posts: 854
Karma: 1200
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
Hello

My advice:

1 - scan in black and white and test OCR in it.
[Remember that the most work and time spent is in proof reading (and correcting) the OCR result, then re-format all the formatting until you have a document that resembles an original from where you could create a new book (in any format)];

2 - if your OCR results are good enough that you think if one day you will be wanting to do OCR and proof reading out of these scannings, consider these PDFs you are now making to be your using files and your base files. Use them with your netbook;

3 - if not (or for the books that the black and white scanning did not give you quality enough for OCR), scan in grey or color and/or go up with the resolution (400dpi or even 600 dpi) until you get good OCR results - these PDFs are now your base files. From these PDFs make black and white PDF files - these are now your use files, read them in your netbook;

4 - make security copies of all your base files.

Conclusion:

a) you are making PDF files to read now;
b) you are putting aside (backing up) base files that in the future, if you want (or the OCR technology grows to the point of creating perfect results with almost no need of human intervention), you can do it not needing to repeat all the process.

Best regards,
DDHarriman is offline   Reply With Quote
Old 05-04-2011, 06:28 AM   #6
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 427
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Thumbs up

You don't have to OCR. But if you'd like to search, highlight, reference and so on, it would be ideal. Otherwise you could use Scan Tailor on the scans, pack them in a PDF and you'd be done. But OCR-ing usually results in a much higher quality output - and quality trumps quantity every time.

Pros:
- cleaner text (free from printing flaws)
- lower filesize
- faster rendering and page flipping on portable (which are usually slower) devices
- fully search-able
- highlighting text is possible
- dictionary look-up
- reflow-able text (ePUB, MOBI, etc.)
- body fonts can be replaced if the user wants to
- www and email links are click-able
- footnotes can be added to the end of the document instead of getting in your face
- in-document references (for instance you could simply click "See page 91")
- text-to-speech (for the visually impaired)

...and maybe more.

Cons:
- proof-reading takes time
- layout takes time
- vectorizing the cover takes time (optional)
- font matching takes time (again, optional) - that's if the font is even available. If not, you'd have to edit a similar font which would take even more time (at least until you get the hang of it)

Is it worth it ? Oh yeah. Like I said, quality trumps quantity. Always. Especially if it's a good book, it's worth it. It's always a pleasure to read a book with smooth text than with jagged, partial, half characters.


Think about it. Out of those 3000 books, which are the top, say, 30 you'd like to keep ? The rest I would probably just archive with Scan Tailor (grayscale), keeping the correct layout, etc. Also, while black and white TIFFs can have a huge impact on filesize (especially in .djvu format), they could prove difficult to OCR in the future as most OCR software have filters that were tweaked to work better with grayscale images. B&W TIFFs can sometimes remove details that would help OCR-ing differentiate tl from a d, for example.

Last edited by DSpider; 05-04-2011 at 06:45 AM.
DSpider is offline   Reply With Quote
Old 05-06-2011, 10:30 AM   #7
srhamm
MAPC grad student
srhamm began at the beginning.
 
srhamm's Avatar
 
Posts: 3
Karma: 10
Join Date: Apr 2011
Location: Georgia, USA
Device: Kindle
DSpider, you wrote: - text-to-speech (for the visually impaired)
For me, TTS was a major reason I got my Kindle. I'm not visually impaired, but maybe a little reading-lazy. I enjoy watching the screen change pages as it reads it to me in real time. Also, I put hook up my Kindle to play over my car speakers while driving--a poor man's audio book of sorts.
srhamm is offline   Reply With Quote
Old 05-07-2011, 12:03 PM   #8
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 427
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Well, you do wear glasses.

But yeah, "TTS" implies the book has been through active proof-reading and sometimes dictionary look-up of a few (occasional) words which may or may not be read correctly.
DSpider is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
OCR Software Help kpfeifle Workshop 5 03-01-2010 03:27 PM
OCR help needed Nate the great Workshop 7 09-22-2009 12:21 AM
OCR to use pepak Workshop 17 05-26-2008 06:30 PM


All times are GMT -4. The time now is 07:20 PM.


MobileRead.com is a privately owned, operated and funded community.