|  02-06-2013, 07:57 PM | #1 | 
| Junior Member  Posts: 3 Karma: 10 Join Date: Feb 2013 Device: None | 
				
				Best format to extract text from  speed vs accuracy
			 
			
			Good folk. I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc). For the purpose of extracting the text (unicode): 1. Which source format is the best to extract from? 2. Which source format would be fastest to extract from? Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy. Does anyone have any experience on this? Thank you all in advance. | 
|   |   | 
|  02-06-2013, 08:31 PM | #2 | |
| Well trained by Cats            Posts: 31,240 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | Quote: 
  I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible. If you HAVE Acrobat, the PDF might not be so bad  . | |
|   |   | 
| Advert | |
|  | 
|  02-06-2013, 08:38 PM | #3 | |
| Junior Member  Posts: 3 Karma: 10 Join Date: Feb 2013 Device: None | Quote: 
 What is Tweak? I've been playing with ebook-convert. | |
|   |   | 
|  02-06-2013, 09:45 PM | #4 | |
| Well trained by Cats            Posts: 31,240 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | Quote: 
 This tool allows you to unpack the books pieces to allow (small) edits, then put them back together when done, maintaining the original structure. For Bigger edits( add/remove chapters..., Sigil is easier for the novice-intermediate). | |
|   |   | 
|  02-06-2013, 10:24 PM | #5 | |
| null operator (he/him)            Posts: 22,005 Karma: 30277294 Join Date: Mar 2012 Location: Sydney Australia Device: none | Quote: 
 AFAIK what you see in the EPUB Viewer is what you'll get in TXT output file - but without any formatting/styling or images - the important settings are the TXT Output settings Given that EPUB is Calibre's native format I would anticipate it might be faster. If you don't have access to PDF editing software like Acrobat, Nitro etc to do the conversions, then you could try 
 I suggest you steer clear of the "Free PDF to ..." converters unless you get a specific recommendation - as in the case of MobiCreator. BR Last edited by BetterRed; 02-06-2013 at 10:29 PM. | |
|   |   | 
| Advert | |
|  | 
|  02-06-2013, 10:41 PM | #6 | 
| Resident Curmudgeon            Posts: 80,665 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | 
			
			Calibre can convert to TXT. Just dump your eBooks into Calibre (not PDF) and batch convert to TXT. You can leave it running overnight. You don't have to care which is faster as it will just do it while you are not at the computer. I don't know the maximum you can queue at one time, but you could do it with Calibre.
		 | 
|   |   | 
|  02-07-2013, 12:54 AM | #7 | 
| Junior Member  Posts: 3 Karma: 10 Join Date: Feb 2013 Device: None | 
			
			Thank you all for the answers and leads.
		 | 
|   |   | 
|  | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Increase Epub Search Speed and Accuracy | Matimio | Sigil | 1 | 12-31-2011 07:08 AM | 
| Page Change Speed - PDF vs <insert format> | Polydwarf | Astak EZReader | 1 | 02-22-2010 02:11 AM | 
| Text to Speech and audio books - speed? | moz | Reading and Management | 3 | 05-30-2008 02:02 PM | 
| What is best format, speed for MP3/Acc files? | jgbrut | Sony Reader | 0 | 11-20-2006 02:02 PM |