|
|
#1 |
|
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
|
Best format to extract text from speed vs accuracy
I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc). For the purpose of extracting the text (unicode): 1. Which source format is the best to extract from? 2. Which source format would be fastest to extract from? Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy. Does anyone have any experience on this? Thank you all in advance. |
|
|
|
|
|
#2 | |
|
Staff to 4 Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,697
Karma: 2485850
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2,Black Astak PEz, K4NT(now Wifes)
|
Quote:
I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible. If you HAVE Acrobat, the PDF might not be so bad .
__________________
Using: Ubuntu(32 bit):Oneric,Precise and XPpro SP3, W7HP(64)- - Libre Office w/Writer2EPUB
|
|
|
|
|
|
Enthusiast
|
|
|
|
#3 | |
|
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
|
Quote:
What is Tweak? I've been playing with ebook-convert. |
|
|
|
|
|
|
#4 | |
|
Staff to 4 Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,697
Karma: 2485850
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2,Black Astak PEz, K4NT(now Wifes)
|
Quote:
This tool allows you to unpack the books pieces to allow (small) edits, then put them back together when done, maintaining the original structure. For Bigger edits( add/remove chapters..., Sigil is easier for the novice-intermediate).
__________________
Using: Ubuntu(32 bit):Oneric,Precise and XPpro SP3, W7HP(64)- - Libre Office w/Writer2EPUB
|
|
|
|
|
|
|
#5 | |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 628
Karma: 433530
Join Date: Mar 2012
Location: NSW Australia
Device: None - see signature
|
Quote:
AFAIK what you see in the EPUB Viewer is what you'll get in TXT output file - but without any formatting/styling or images - the important settings are the TXT Output settings Given that EPUB is Calibre's native format I would anticipate it might be faster. If you don't have access to PDF editing software like Acrobat, Nitro etc to do the conversions, then you could try
I suggest you steer clear of the "Free PDF to ..." converters unless you get a specific recommendation - as in the case of MobiCreator. BR
__________________
Windows 7Pro 64bit Desktop, Mint UL portable with Firefox & EpubReader Last edited by BetterRed; 02-06-2013 at 10:29 PM. |
|
|
|
|
|
|
#6 |
|
Mobile Reader Geek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 34,205
Karma: 13801264
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad
|
Calibre can convert to TXT. Just dump your eBooks into Calibre (not PDF) and batch convert to TXT. You can leave it running overnight. You don't have to care which is faster as it will just do it while you are not at the computer. I don't know the maximum you can queue at one time, but you could do it with Calibre.
__________________
|
|
|
|
|
|
#7 |
|
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
|
Thank you all for the answers and leads.
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Increase Epub Search Speed and Accuracy | Matimio | Sigil | 1 | 12-31-2011 07:08 AM |
| Page Change Speed - PDF vs <insert format> | Polydwarf | Astak EZReader | 1 | 02-22-2010 02:11 AM |
| Text to Speech and audio books - speed? | moz | Reading and Management | 3 | 05-30-2008 02:02 PM |
| What is best format, speed for MP3/Acc files? | jgbrut | Sony Reader | 0 | 11-20-2006 02:02 PM |