Extract PDF text and store in custom column

diazlaz · 12-27-2013, 03:38 PM

Hi All,

I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column.

I was thinking of the following:

1. enumerate the selected files (select PDF types)
2. open/extract the PDF text content from PDF file
3. store in column

Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl

Many thanks!
Laz.

BetterRed · 12-27-2013, 04:52 PM

My experience is that searches will be slower if the Comments column or similar long text columns are included in 'the places to search' - AFAIK the contents aren't indexed.

But that's no reason to not go ahead with a PI

BR

kovidgoyal · 12-30-2013, 11:00 PM

Simply use calibre to bulk convert your PDF files to txt. That will extract the text from them (assuming they have actual extractable text, which is not always the case with PDF files).

12-27-2013, 03:38 PM	#1
diazlaz Member Posts: 15 Karma: 10 Join Date: Dec 2003 Device: Toshiba E755	Extract PDF text and store in custom column Hi All, I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column. I was thinking of the following: 1. enumerate the selected files (select PDF types) 2. open/extract the PDF text content from PDF file 3. store in column Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl Many thanks! Laz.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom yes/no column built from long text column	Philantrop	Library Management	7	03-23-2013 08:44 PM
Two-column-text possible on M92 (NOT in PDF)	Andy_T	Onyx Boox	7	02-01-2013 09:31 AM
No search/replace on custom long text column?	CWatkinsNash	Library Management	2	06-20-2011 06:57 PM
Converting text for a datetime custom column	kiwidude	Development	2	02-26-2011 11:47 AM
Custom text column with no HTML	mfaine	Calibre	2	01-07-2011 02:12 PM

12-27-2013, 04:52 PM	#2
BetterRed null operator (he/him) Posts: 22,018 Karma: 30277294 Join Date: Mar 2012 Location: Sydney Australia Device: none	My experience is that searches will be slower if the Comments column or similar long text columns are included in 'the places to search' - AFAIK the contents aren't indexed. But that's no reason to not go ahead with a PI BR

12-30-2013, 11:00 PM	#3
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Simply use calibre to bulk convert your PDF files to txt. That will extract the text from them (assuming they have actual extractable text, which is not always the case with PDF files).