View Single Post
Old 02-06-2013, 08:31 PM   #2
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 16,153
Karma: 6613832
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Astak Pocket Pro, K4NT,Galaxy Tab 2
Quote:
Originally Posted by Txomin View Post
Good folk.

I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).

For the purpose of extracting the text (unicode):

1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?

Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.

Does anyone have any experience on this?

Thank you all in advance.
Your experiment is pretty good.

I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible.

If you HAVE Acrobat, the PDF might not be so bad .
theducks is online now   Reply With Quote