View Single Post
Old 02-06-2013, 07:57 PM   #1
Txomin
Junior Member
Txomin began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
Best format to extract text from speed vs accuracy

Good folk.

I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).

For the purpose of extracting the text (unicode):

1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?

Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.

Does anyone have any experience on this?

Thank you all in advance.
Txomin is offline   Reply With Quote