Originally Posted by Txomin
I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).
For the purpose of extracting the text (unicode):
1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?
Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.
Does anyone have any experience on this?
Thank you all in advance.
Your experiment is pretty good.
I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible.
If you HAVE Acrobat, the PDF might not be so bad