Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 11-04-2009, 10:53 AM   #16
kennyc
The Dank Side of the Moon
kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.
 
kennyc's Avatar
 
Posts: 35,872
Karma: 118716293
Join Date: Sep 2009
Location: Denver, CO
Device: Kindle2; Kindle Fire
Quote:
Originally Posted by Jim Thompson View Post
Been encountering frustrations and am sharing them for the sake of anyone else as uninformed as me:

I tried purchasing a lit file (microsoft reader) because Calibre's documentation indicates that is it's their easiest file to translate. After I paid for the lit file, I was informed that I would need to install Microsoft Publisher on my computer before it would download. Half way through the Microsoft Publisher install, I was informed that I needed to create a Microsoft Passport Account to activate the software. After doing all of that (and approving installation of Active-X controls I don't understand or want), the lit file still would not download and purusing help screens indicates that Microsoft Publisher can only work on a hand-held device; i.e., not on a desktop computer. I contacted tech support at the eBook publisher to aks if I would have more luck with MOBI or EPUB. That's where things stand currently. I'll update again when I learn more.
Sounds to me like your file type handling is messed up. You should be able to download and "Save to Disk" instead of opening it with an application.

Where are you purchasing/downloading from? Is the file DRM protected or open?

This is probably the wrong thread for this discussion in any case. You might want to ask in the Calibre or some other subforum.
kennyc is offline   Reply With Quote
Old 11-04-2009, 11:35 AM   #17
AnemicOak
Bookaholic
AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.AnemicOak ought to be getting tired of karma fortunes by now.
 
AnemicOak's Avatar
 
Posts: 14,391
Karma: 54969924
Join Date: Oct 2007
Location: Minnesota
Device: iPad Mini 4, AuraHD, iPhone XR +
Quote:
Originally Posted by Jim Thompson View Post
Been encountering frustrations and am sharing them for the sake of anyone else as uninformed as me:

I tried purchasing a lit file (microsoft reader) because Calibre's documentation indicates that is it's their easiest file to translate. After I paid for the lit file, I was informed that I would need to install Microsoft Publisher on my computer before it would download. Half way through the Microsoft Publisher install, I was informed that I needed to create a Microsoft Passport Account to activate the software. After doing all of that (and approving installation of Active-X controls I don't understand or want), the lit file still would not download and purusing help screens indicates that Microsoft Publisher can only work on a hand-held device; i.e., not on a desktop computer. I contacted tech support at the eBook publisher to aks if I would have more luck with MOBI or EPUB. That's where things stand currently. I'll update again when I learn more.
The software needed is Microsoft Reader (only works on Windows), not Publisher (which is part of MS Office)...
http://www.microsoft.com/reader/downloads/pc.aspx

It will need to be activated and book downloads will need to be done with Internet Explorer (assuming this is a DRM'd book). For Calibre to work with the book the DRM will need to be removed first (ConvertLit).


Going with Mobi or ePub will still require removing the DRM and will have their own software to deal with. Adobe Digital Editions (ePub) will need an Adobe ID to authorize it.
AnemicOak is offline   Reply With Quote
Advert
Old 11-04-2009, 11:55 AM   #18
kennyc
The Dank Side of the Moon
kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.kennyc ought to be getting tired of karma fortunes by now.
 
kennyc's Avatar
 
Posts: 35,872
Karma: 118716293
Join Date: Sep 2009
Location: Denver, CO
Device: Kindle2; Kindle Fire
I was confused, I'm still trying to get all the various formats in my mind....and it's no easy chore to get things in there....I was thinking .lit was a Sony Format, but that is LRX LRF I guess. So it really is associated with MS.
kennyc is offline   Reply With Quote
Old 11-04-2009, 12:10 PM   #19
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
Sorry. Didn't mean to drift this thread away from scanning/OCR. Just wanted to post a warning for anyone heading down my path. (And sorry about saying Publisher instead of Reader).

Thanks for other insights too, but kennyc is right, this thread should just be about scanning/OCR. When I post an update of how it worked out, I will create a new thread under Workshop that will be entitled "purchasing eBooks with intent to convert". I'm new to this forum, so if that is an inappropriate topic on this forum, let me know and I'll just drop it. My goal in posting there would not be to get new information, but just to answer any questions someone might have on that topic after reading earlier posts in this thread.

Back to scanning/OCR.
Jim Thompson is offline   Reply With Quote
Old 11-04-2009, 12:21 PM   #20
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
Quote:
Originally Posted by Rootman View Post
I hesitate to add that converting a physical book to etext may also violate the publishers and authors copyrights as well. I do not beleive that the "fair use" doctrine applies to text scanned from a copyright source.

Not knowing what country you are in and what texts you want of course YMMV
In the US, format-shifting for personal use is almost certainly fair use. And I only say "almost" because there's no court cases to base that on, because nobody's been stupid enough to try to sue their customers for it. Nobody's been sued for making cassette tapes of their own albums for their own uses, either. Nor for reading aloud a book and recording it for their child.

Format-shifting for research purposes is somewhat *more* acceptable; that moves the purpose of copying into the "transformative" realm. Note that Google's cache function was ruled to be non-infringing.

Trying to keep from hijacking this into yet another copyright debate... hmm.

Jim should keep in mind that no OCR software is perfect, especially with books with nonstandard language. (Legal, medical, scientific, religious, sf/fantasy, etc.) With an autofeed scanner, the OCR checking becomes the most time-consuming part of conversion. Even with a flatbed scanner, the OCR checking is the most *annoying* part of conversion, because you can scan while talking or listening to music.
Elfwreck is offline   Reply With Quote
Advert
Old 11-04-2009, 08:59 PM   #21
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
Quote:
Originally Posted by Elfwreck View Post
the OCR checking becomes the most time-consuming part of conversion
Any other thoughts on best OCR software/scanner combination. I had been thinking a reliable ADF might be the most challenging part, but I now suspect reliable OCR is key. Which has few errors and makes correcting errors easiest? For example, if the OCR doesn't know if a word is "fire" or "flre", it might be nice if it could guess that it's "fire" by use of a dictionary. Any suggestions on ways to reduce the "annoyance" factor?
Jim Thompson is offline   Reply With Quote
Old 11-05-2009, 01:18 PM   #22
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
I have learned (I think):

1. PCMag review in (I think it was '08) suggested ABBYY or OmniPage depending upon ones needs.

2. People tend to prefer interface and support for ABBYY.

3. OmniPage may be better for people who want more customized automation or larger batches. For example it will permit creating a zone template that will remove standard headers/footers (page #s) from every document in a batch fed through the ADF -- whereas that would have to be done a page at a time in ABBYY.

4. ABBYY seems to have a relationship wit Fujitsu, so if buying a scanner and software, there might be less frustration with getting them to work well together.

I'm looking into OmniPage for my purposes. Trying to determine if I need the professional version and which ADF scanners seem most used with it. I'll report again when I'm smarter, or if you have insight, please enlighten me.
Jim Thompson is offline   Reply With Quote
Old 11-05-2009, 01:46 PM   #23
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
I have no idea about software/scanner combos; I never have any choice about scanners. (I've been working in digital imaging centers for 10 years; I get to know a lot of scanners, but they're not chosen based on how well they interface with the OCR software. Other aspects are always more important. And how well the hardware & software work together for batches of 10,000 pages is not particularly relevant for scanning a boxful of novels.)

I've worked a lot with FineReader; I'm barely aware of OmniPage. What I've heard is that FR is overall better, but that's because of features that don't matter to everyone.

FR will allow zoning templates of a sort; it'll let you create a page-zone template and apply it to as many pages as you like. It won't "ignore headers/footers," but if the pages are the same size & shape, the zoning template will do that.

FR allows customized zoning, which is important if you're working with pages with complex layouts. I have no idea how customizable OP's zoning is.

I gather that FR's OCR accuracy has increased since I've gotten it; I'm working with 7.0 and considering whether (or when) to upgrade. It does allow you to add words to its custom dictionary, so it'll learn to recognize names and obscure terms. I'd expect OmniPage to have something like that as well.

In regards to scanner choice: look at reviews at Newegg and other places. Watch out for comments like "skipped pages" and "doublefeeds." Ignore comments about pages-per-minute; they all scan slower than advertised. Find out if the scanning software allows you to set custom page sizes, because it's really annoying to scan paperback books onto letter-sized pages with lots of whitespace.
Elfwreck is offline   Reply With Quote
Old 11-05-2009, 02:07 PM   #24
DDHarriman
Guru
DDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura about
 
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
Hi Jim

I’m repeating what someone else said before, what you need is:
1 - Scanner - OpticBook 3600;
2 - Softwate - Finereader Pro or Omnipage Pro.

Forget the “speed” of a feeder, you need control from the beginning, thus thinking the workflow correctly is the key.

The Scanner - you can get full control of the book, directly from the book itself. No need to cut the binding, etc…
Also you can cut the extra stuff (heades/page numbers) directly when scanning - if they are easily independent in space from the rest of the text -, just by taking care to make the scanning area small enough to not get those “extra info” want scanning… just by doing this you can see the time spend recovered.
I can not call attention enough for the the book part of this scanner - it’s designed for books! so it’s optimized for it! -, in scanning speed, recovering curved spine problems (in a normal one you get back/grey) etc…

The software - get the last version of the one you choose - 10 for Finereader and 17 for Omnipage. I have found that every new version as been over the years an outstanding quantum leap from the last one.
My advice is, get Finereader just for one thing, it’s normally 1/2 to 1/3 cheaper the Omnipage.
But notice, both are top of the art.

Finally, it will take time.
Does not matter how shortcuts one takes, reverting from a analog text (paper) to digital, and if one does want to do it well, the result correctly formatted, etc… it takes time.
Managers of professional digitizing projects with that final objective always allocate some 60 to 75% time for proof reading a formatting at least.
So… the secret is to get the best OCR results - meaning the least errors and formatting problems, the best any of these last - and thus taking time to experience with the correct resolution of scanning, contrast, etc… before even beginning production… all the time one gets here, pays back fully in time not lost later.

Hope I was of some help, best regards,

Last edited by DDHarriman; 11-05-2009 at 02:21 PM.
DDHarriman is offline   Reply With Quote
Old 11-05-2009, 03:19 PM   #25
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
Quote:
Originally Posted by Elfwreck View Post
...how well the hardware & software work together for batches of 10,000 pages is not particularly relevant for scanning a boxful of novels.) ... FR will allow zoning templates of a sort; it'll let you create a page-zone template and apply it to as many pages as you like. It won't "ignore headers/footers," but if the pages are the same size & shape, the zoning template will do that..
I talked to AABBY technical support this morning. The guy put me on hold and asked his collegues. Then he said that I would need to touch each page to achieve the kind of zoning that I want: To automatically cut off the header/page number of every page in a novel, so that a series of words that began on page 4 and ended on page 5 could be searched as text and found as words in a single sentence. Do you think my misunderstanding was with him or with you? It's very important to me that I transform thousands of pages into searchable text without having to look at each page.

Hadn't previously seen Newegg. Thanks for that tip and the tip about "skipped pages" and "doublefeeds."
Jim Thompson is offline   Reply With Quote
Old 11-05-2009, 03:35 PM   #26
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
Quote:
Originally Posted by DDHarriman View Post
I’m repeating what someone else said before, what you need is:
1 - Scanner - OpticBook 3600;
2 - Softwate - Finereader Pro or Omnipage Pro.

Forget the “speed” of a feeder, you need control from the beginning, thus thinking the workflow correctly is the key.
Thanks DDHarriman. The reason I didn't pursue that suggestion, is that it is just too discouraging to think I need to turn each page manually. I've heard ADFs and OCR are pretty reliable these days (especially since I don't have pictures, tables, columns, etc. to mess with), so I'd much rather use the more automated approach. Of course, I may be unrealistic in my expectations, but if I have to sacrifice, I will sacrifice on accuracy before I sacrifice on operator time. If accuracy is too bad, I will either give up, or outsource to someone with specialized equipment and cheap labor.


Quote:
Originally Posted by DDHarriman View Post
get the last version of the one you choose...I have found that every new version as been over the years an outstanding quantum leap from the last one. ... the secret is to get the best OCR results - meaning the least errors and formatting problems, the best any of these last - and thus taking time to experience with the correct resolution of scanning, contrast, etc… before even beginning production… all the time one gets here, pays back fully in time not lost later
Thank you. I really appreciate those insights! I often think of something a mentor/friend told me many years ago: "Exerience is the most valuable commodity. And he who buys it second had is wisests." I sincerely appreciate your sharing your experience so I don't have to glean these insights the hard way.
Jim Thompson is offline   Reply With Quote
Old 11-05-2009, 04:45 PM   #27
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
Quote:
Originally Posted by Jim Thompson View Post
I talked to AABBY technical support this morning. The guy put me on hold and asked his collegues. Then he said that I would need to touch each page to achieve the kind of zoning that I want: To automatically cut off the header/page number of every page in a novel, so that a series of words that began on page 4 and ended on page 5 could be searched as text and found as words in a single sentence. Do you think my misunderstanding was with him or with you? It's very important to me that I transform thousands of pages into searchable text without having to look at each page.
My guess: He's wrong; he, and his colleagues, have very likely not zoned thousands of pages and looked for shortcuts to make the process faster. Technically, he's correct; there is no auto-cutoff of headers/footers. However, there are ways to avoid zoning them.

*IF* the pages are the same size & shape, you can open one of them, zone it, save the zoned blocks, and apply those to all pages. I do this often. And then I scroll through the thumbnail view of the pages to see if there are any specific pages needing different zoning--first pages of chapters, or double-column index, or table of contents.

If the pages aren't the same size & shape because the scanning wasn't done with that in mind, I will auto-zone all the pages, flip through them one at a time, and manually adjust the zones, either deleting the tiny box around the header, or dragging it down to just the main body content. This takes longer, but still not as long as manually zoning each page.

The zoning, however it's done, is much quicker and less frustrating than the OCR correction. FineReader's OCR is very good--but checking it requires stopping at each *suspect*, most of which are done correctly & just need to be confirmed.
Elfwreck is offline   Reply With Quote
Old 11-05-2009, 05:24 PM   #28
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
So, do I apply the zoning once per batch? And where text is unusual on a page (such as the first page of a chapter without a page header), will the zoning still treat the top of the page as a standard block (in this case empty) which gets ignored the same as pages that do have headers?
Jim Thompson is offline   Reply With Quote
Old 11-05-2009, 06:29 PM   #29
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
You pick a typical page in the batch, and zone that page the way you want it to read. For me, this means "draw a text box around the main body text, and extend it about a quarter inch into empty space all the way around." Or however much empty space is possible.

Then save out the zoning blocks. Then select all the other pages (except for weird ones), and apply the blocks to those pages. (Then go back & zone the weird ones. Anything with columns, pictures, or tables.) The blocks will appear in the same place on all the pages--so if they skipped the headers/footers on the template page, they won't catch them on other pages.

It won't matter if the zoning includes empty space, so if start-of-chapters start halfway down the page, you won't need to rezone those. If, however, start-of-chapters have an actual different layout from the other pages, they'd have to be done separately.
Elfwreck is offline   Reply With Quote
Old 11-06-2009, 01:08 AM   #30
Jim Thompson
Member
Jim Thompson began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Nov 2009
Device: iPhone
Thanks. I think I'm getting it. For my purposes, I would usually have about 200 pages -- all with the same zoning. I realize there may be exceptions, but I think a single zoning approach will work for all of most novels.

The most common scenerio might be: 1. header, 2. text, 3. footer (page #). Some of the pages won't have a header or a footer, but they will tend to have nothing I need printed that high or low; i.e., in that zone block.

About page 4 might be the first that has standard header/footer. So after scanning the 200 pages, I would use ABBYY Profesional to select the 4th page, I would use the mouse to draw 3 zones: 1) header, 2) text, 3) footer. I would tell ABBYY that I want to ignore (eliminate from OCR) the header and footer zone. Then I would save those zone blocks. Then I would want to select all 200 pages (is that one-click/operation?) and apply the zone blocks to all 200 pages (again just one click/operation?). Is that how it would work?

Sorry to be so inexperienced.

I looked at some novels to write this and see a challenge. On some novels there is as little as 1/8" between the header and the body of the text. What's worse, the pages are not physically consistent. When I measure the inches from the top of the physical piece of paper to the printed header, it may vary by 1/8" or so. Also, the variation can be page-by-page (as opposed to 50 pages one way and 50 pages another). Ouch! Those books will be a challenge if I can't get them printed in a different format. Do you have any suggestions for easy block zoning with that kind of book or will those require eyeballing each page???
Jim Thompson is offline   Reply With Quote
Reply

Tags
adf, crop, dehypenization, ocr, scanner


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Suggest First contact SF novels and more Verner Vinge like authors please rollercoaster Reading Recommendations 51 08-27-2010 01:13 PM
What would you suggest for HTML->epub? radius Workshop 9 07-25-2010 06:48 AM
"Online Novels" - FREE, legal novels available on the Internet Dr. Drib Deals and Resources (No Self-Promotion or Affiliate Links) 8 05-22-2009 09:32 PM
Suggest a Story (Round 1) Moejoe Writers' Corner 110 05-17-2009 10:18 PM
Suggest your own eBook Reader dj_modus_ponens News 27 12-03-2007 03:58 AM


All times are GMT -4. The time now is 02:28 AM.


MobileRead.com is a privately owned, operated and funded community.