| 
			
			 | 
		#1 | 
| 
			
			
			
			 Member 
			
			![]() Posts: 15 
				Karma: 10 
				Join Date: Apr 2011 
				Location: CA 
				
				
				Device: Nook Simple Touch, 4th Gen Kindle Basic 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Searching for corrupted epubs?
			 
			
			
			I have about 5000 epub books and I've begun to notice that a few of them are "corrupted" (all weird characters, partial words) when I put them on my Nook and open them. I'm pretty sure it was a bad file that I downloaded from gutenberg or another place.  
		
	
		
		
		
		
		
		
		
		
		
		
	
	Is there a feature in Calibre that can search for messed up files like these so I can delete them?  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | |
| 
			
			
			
			 Well trained by Cats 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,267 
				Karma: 61916422 
				Join Date: Aug 2009 
				Location: The Central Coast of California 
				
				
				Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Bad OCR or Wrong Character encoding. Neither is normal PG completed book fare. Are you sure you did not get a 'Proofing Project' file? (but aren't they normally given out as partials?) PG books all have a PG Scan notes page up front.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Member 
			
			![]() Posts: 15 
				Karma: 10 
				Join Date: Apr 2011 
				Location: CA 
				
				
				Device: Nook Simple Touch, 4th Gen Kindle Basic 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I downloaded from archive, manybooks, google books, and PG and that was years ago. I have no idea what I might have done to get these weird files. I'm thinking some of them may have come from google ebooks to be honest.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | |
| 
			
			
			
			 Fanatic 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515 
				Karma: 1470724 
				Join Date: Jul 2013 
				Location: Quebec CA 
				
				
				Device: android 4 (samsung tablet and asus tablet) 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | |
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 108 
				Karma: 810 
				Join Date: Jul 2012 
				
				
				
				Device: Kobo 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Bulk Library Search for OCR Warning Indicators
			 
			Quote: 
	
 Having done that, and using a temporary column to label the selected books that contain those OCR warnings, the number of occurrences for those characters within any one book can be determined using the "search - Count All" feature in Sigil or Calibre's book-edit, but does anyone know of a Calibre feature that could perform a bulk determination of the number of occurrences of such character strings in that selected subset of books such that the number for each book can be stored in a temporary sort column in Calibre in order to more easily find the most problematic books?  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#6 | |
| 
			
			
			
			 Well trained by Cats 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,267 
				Karma: 61916422 
				Join Date: Aug 2009 
				Location: The Central Coast of California 
				
				
				Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Some OS use a square, others display a box with the (utf?)code digits  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | |
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 108 
				Karma: 810 
				Join Date: Jul 2012 
				
				
				
				Device: Kobo 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Theducks, in light of your experience with Calibre, a second take-away from your response would I guess be that there doesn't seem to be any current functionality in Calibre or its add-ons that would allow a user to make a bulk determination, as per my prior email, of the number of occurrences of such character strings in the selected subset of problem books.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 108 
				Karma: 810 
				Join Date: Jul 2012 
				
				
				
				Device: Kobo 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Bulk Library Search for OCR Warning Indicators (partially solved)
			 
			
			
			It turns out that the Search ePub feature under the Quality Check add-on for Calibre includes an option to "show all occurrences".  That option at least indirectly provides a bulk search capability to identify the number of times a special OCR-error character like � appears in the various ePubs. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	If the "show all occurrences" option is selected (maybe best at first to limit this to a subset of the books identified as having at least one occurrence), then every occurrence within that subset of ePubs is listed on a separate line in the results log and that list can be copied into something like Excel to be analyzed to identify the most problematic books.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 108 
				Karma: 810 
				Join Date: Jul 2012 
				
				
				
				Device: Kobo 
				
				
				 | 
	
	
	
		
		
			
			 
				
				small black square too
			 
			
			
			As per the approach described above, the following is another OCR error warning symbol that, in addition to "\^" and "�", can be used as a search criteria in Search ePub: "■". 
		
	
		
		
		
		
		
		
		
		
		
		
	
	As noted above, "search for all occurrences" can be used to bulk-determine which ePubs may have the worst problems. It's not clear as yet which version of the code-embedded white square e.g. "" might be used as a broad search criteria.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#10 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 108 
				Karma: 810 
				Join Date: Jul 2012 
				
				
				
				Device: Kobo 
				
				
				 | 
	
	
	
		
		
			
			 
				
				small black square - NO
			 
			
			
			My mistake.   
		
	
		
		
		
		
		
		
		
		
		
		
		
			I found that, at least for the ePubs I have encountered, searching for that small black square "■' does not really help much in finding ePubs with extensive OCR errors. When I checked ePubs with a lot of occurrences (whether or not there were also "^" warnings), that black square was almost invariably being used for other display reasons and not as an OCR error indicator. Last edited by Rob557; 02-16-2015 at 12:51 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| 0.96 corrupted? | mopgcw | Calibre | 1 | 11-14-2012 04:16 PM | 
| trouble when converting many epubs to epubs | comet | Conversion | 13 | 03-21-2012 02:57 AM | 
| Touch Problem with all epubs, my epubs, or my kobo? (line clipping) | plague006 | Kobo Reader | 14 | 12-03-2011 12:32 AM | 
| Searching and converting all EPUBs I have | Giuseppe Chillem | Calibre | 3 | 11-14-2011 05:57 AM |