Best format to extract text from speed vs accuracy
Good folk.
I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).
For the purpose of extracting the text (unicode):
1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?
Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.
Does anyone have any experience on this?
Thank you all in advance.
|