View Single Post
Old 02-06-2013, 08:31 PM   #2
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 14,839
Karma: 5654321
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by Txomin View Post
Good folk.

I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).

For the purpose of extracting the text (unicode):

1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?

Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.

Does anyone have any experience on this?

Thank you all in advance.
Your experiment is pretty good.

I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible.

If you HAVE Acrobat, the PDF might not be so bad .
theducks is offline   Reply With Quote