| 
			
			 | 
		#1 | 
| 
			
			
			
			 Junior Member 
			
			![]() Posts: 3 
				Karma: 10 
				Join Date: Feb 2013 
				
				
				
				Device: None 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Best format to extract text from  speed vs accuracy
			 
			
			
			Good folk. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc). For the purpose of extracting the text (unicode): 1. Which source format is the best to extract from? 2. Which source format would be fastest to extract from? Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy. Does anyone have any experience on this? Thank you all in advance.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | |
| 
			
			
			
			 Well trained by Cats 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,267 
				Karma: 61916422 
				Join Date: Aug 2009 
				Location: The Central Coast of California 
				
				
				Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible. If you HAVE Acrobat, the PDF might not be so bad  .
		 | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#3 | |
| 
			
			
			
			 Junior Member 
			
			![]() Posts: 3 
				Karma: 10 
				Join Date: Feb 2013 
				
				
				
				Device: None 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 What is Tweak? I've been playing with ebook-convert.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | |
| 
			
			
			
			 Well trained by Cats 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,267 
				Karma: 61916422 
				Join Date: Aug 2009 
				Location: The Central Coast of California 
				
				
				Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 This tool allows you to unpack the books pieces to allow (small) edits, then put them back together when done, maintaining the original structure. For Bigger edits( add/remove chapters..., Sigil is easier for the novice-intermediate).  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | |
| 
			
			
			
			 null operator (he/him) 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,018 
				Karma: 30277294 
				Join Date: Mar 2012 
				Location: Sydney Australia 
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 AFAIK what you see in the EPUB Viewer is what you'll get in TXT output file - but without any formatting/styling or images - the important settings are the TXT Output settings Given that EPUB is Calibre's native format I would anticipate it might be faster. If you don't have access to PDF editing software like Acrobat, Nitro etc to do the conversions, then you could try 
 I suggest you steer clear of the "Free PDF to ..." converters unless you get a specific recommendation - as in the case of MobiCreator. BR Last edited by BetterRed; 02-06-2013 at 11:29 PM.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#6 | 
| 
			
			
			
			 Resident Curmudgeon 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 80,782 
				Karma: 150249619 
				Join Date: Nov 2006 
				Location: Roslindale, Massachusetts 
				
				
				Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Calibre can convert to TXT. Just dump your eBooks into Calibre (not PDF) and batch convert to TXT. You can leave it running overnight. You don't have to care which is faster as it will just do it while you are not at the computer. I don't know the maximum you can queue at one time, but you could do it with Calibre.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 Junior Member 
			
			![]() Posts: 3 
				Karma: 10 
				Join Date: Feb 2013 
				
				
				
				Device: None 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thank you all for the answers and leads.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
| Thread Tools | Search this Thread | 
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Increase Epub Search Speed and Accuracy | Matimio | Sigil | 1 | 12-31-2011 08:08 AM | 
| Page Change Speed - PDF vs <insert format> | Polydwarf | Astak EZReader | 1 | 02-22-2010 03:11 AM | 
| Text to Speech and audio books - speed? | moz | Reading and Management | 3 | 05-30-2008 03:02 PM | 
| What is best format, speed for MP3/Acc files? | jgbrut | Sony Reader | 0 | 11-20-2006 03:02 PM |