| 
			
			 | 
		#1 | 
| 
			
			
			
			 Member 
			
			![]() Posts: 15 
				Karma: 10 
				Join Date: Dec 2003 
				
				
				
				Device: Toshiba E755 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Extract PDF text and store in custom column
			 
			
			
			Hi All, 
		
	
		
		
		
		
		
		
		
		
		
		
	
	I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column. I was thinking of the following: 1. enumerate the selected files (select PDF types) 2. open/extract the PDF text content from PDF file 3. store in column Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl   Many thanks! Laz.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 null operator (he/him) 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,018 
				Karma: 30277294 
				Join Date: Mar 2012 
				Location: Sydney Australia 
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			My experience is that searches will be slower if the Comments column or similar long text columns are included in 'the places to search' - AFAIK the contents aren't indexed. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	But that's no reason to not go ahead with a PI BR  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#3 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Simply use calibre to bulk convert your PDF files to txt. That will extract the text from them (assuming they have actual extractable text, which is not always the case with PDF files).
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Custom yes/no column built from long text column | Philantrop | Library Management | 7 | 03-23-2013 08:44 PM | 
| Two-column-text possible on M92 (NOT in PDF) | Andy_T | Onyx Boox | 7 | 02-01-2013 09:31 AM | 
| No search/replace on custom long text column? | CWatkinsNash | Library Management | 2 | 06-20-2011 06:57 PM | 
| Converting text for a datetime custom column | kiwidude | Development | 2 | 02-26-2011 11:47 AM | 
| Custom text column with no HTML | mfaine | Calibre | 2 | 01-07-2011 02:12 PM |