MobileRead Forums - View Single Post - Best way to copy text from a PDF or MOBI?

Tex2002ans · 10-02-2013, 09:44 PM

Quote:

Originally Posted by mb2u

All my prospective conversion are non-fiction.

Glorious!

If you are serious about OCRing and getting high quality work out there, I would not mind teaching everything I know. (I am free over AIM/YIM/MSN/Skype/email).

While you can OCR for your own personal benefit, the benefit does not outweigh the costs (I spend about 8-15 hours just to get a great EPUB, but just starting, you might be spending 40+ hours on a book).

In my opinion, you should try to tackle works that are in the public domain, or books that are released as CC (Creative Commons). After finishing your OCR, and making a clean EPUB, you can then post it on MobileRead/elsewhere so that the ENTIRE WORLD can benefit from your conversion (instead of just you).

Archive.org has scans of a massive amount of public domain books. Or if you are interested in some "training materials", I have a bunch of journal articles that need OCR (~13 pages each).

Tackling the easy/short stuff I believe would have built up my skills/familiarity with the tools way faster, and it definitely keeps the motivation up (makes you feel like you are actually ACCOMPLISHING SOMETHING).

When I first jumped in to OCR I decided it would be a good idea to tackle all the hard stuff first... I wish I didn't do that!

When I used to tackle these large books that were complex/way out of my league, I would spend an entire week on it and felt like I got nowhere!

Quote:

Originally Posted by mb2u

I know what you mean....it would destroy the flow of the story correcting errors in fiction. It would demolish it!

The few fiction books that I actually wanted to read (that were PDF only)... I pretty much just had to feed it through OCR, export, split chapters really fast, and run a few basic cleanup regex. Then I read through the book in Sigil and fixed the errors as I came across them while reading. Took forever, but nothing was spoiled.